# Transformers

In this notebook, we are going to define and implement a Transformer model.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

## Scaled Dot-Product Attention

<img src='assets/1/scaled-dot-product-3.PNG'/> 

In [3]:
# Scaled Dot-Product Attention calculation 
# from the paper (section 3.2.1 Scaled Dot-Product Attention):
# 'We compute the dot products of the query with all keys, divide each by √dk, 
# and apply a softmax function to obtain the weights on the values.'

class ScaledDotProductAttention(nn.Module):
    def __init__(self, mask=False):
        super(ScaledDotProductAttention, self).__init__()
        self.mask=mask
        
    def forward(self, Q, K, V):
        
        # compute dot product of the all query with all keys
        dot_products = torch.matmul(Q, torch.transpose(K, -2, -1)) # transpose on last two dimensions
        
        # divide each by √dk
        d_k = K.shape[-1]    # get length of key vector
        scaled_dot_products = dot_products / np.sqrt(d_k)
        
        # apply a softmax function to obtain weights on values
        weights = F.softmax(scaled_dot_products, dim=-1)
        
        # get weighted values by multiplying values with softmax scores
        weighted_values = torch.matmul(weights, V)
        
        # apply mask to prevent positions from attending to subsequent positions in decoder
        if self.mask==True:
            size = weighted_values.shape
            look_ahead_mask = torch.triu(torch.full(size, float('-inf')), diagonal=1)
            look_ahead_mask[look_ahead_mask == 0] = 1
            weighted_values = weighted_values * look_ahead_mask
        
        return weighted_values


Scaled Dot-product attention is identical to the [Dot-product attention algorithm](https://arxiv.org/pdf/1508.04025.pdf), except for the scaling  factor of 1/√dk.
>For large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by 1/√dk. <sub>-from section 3.2.1 Scaled Dot-Product Attention<sub>

**Masked Multi-Head Attention**

<img src='assets/1/masked-attention.PNG' width=50% height=50% /> 

>We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We **implement this inside of scaled dot-product attention** by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. <sub>-from section 3.2.3 Applications of Attention in our Model<sub>

>We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i. <sub>-from section 3.1 Encoder and Decoder Stacks<sub>
    
 ---

Let's do a unit test on Scaled Dot-Product Attention implementation.

<img src='assets/1/scaled-dot-product-matrix.PNG' /> 

In [35]:
# define data, n=3 embeddings(sequence length), ex: [Je, suis, etudiant]
n = 3

# from section 3.2.2 Multi-Head Attention
# d_key = d_value = d_query
d_key = 64 # key dimensionality

# define K,Q and V matrices
K = torch.randn(n, d_key)
Q = torch.randn(n, d_key)
V = torch.randn(n, d_key)

scaled_dot_product = ScaledDotProductAttention()

# apply scaled-dot product attention
attention_scores = scaled_dot_product(Q, K, V)    

attention_scores.shape

torch.Size([3, 64])

In [None]:
# unit test for scaled dot-prduct attention 

# each attention distribution should sum up to one

print(torch.sum(attention_scores, dim=-1))

In [5]:
scaled_dot_product = ScaledDotProductAttention(mask=True)

masked_output = scaled_dot_product(Q, K, V)
print(masked_output.shape)

print(masked_output[:, :5])

torch.Size([3, 64])
tensor([[-0.1329,     inf,    -inf,     inf,    -inf],
        [-0.3348, -1.2286,    -inf,     inf,    -inf],
        [-0.3494, -1.1837,  0.2570,     inf,    -inf]])


## Attention

<img src='assets/1/attention-3.PNG'/> 

In [6]:
class Attention(nn.Module):
    def __init__(self, d_model=512, d_key=64, mask=False):
        super(Attention, self).__init__()
        
        # define Key, Query and Value weight matrices
        self.WK = nn.Parameter(torch.randn(d_model, d_key))
        self.WQ = nn.Parameter(torch.randn(d_model, d_key))
        self.WV = nn.Parameter(torch.randn(d_model, d_key))
        
        # init scaled dot-product attention
        self.scaled_dot_product = ScaledDotProductAttention(mask)
        
    def forward(self, inputs, encoder_output=None):
        WK, WQ, WV = self.WK, self.WQ, self.WV
        X = inputs if encoder_output is None else encoder_output
        
        # apply linear transformations
        
        # get packed Key, Query and Value matrices
        # by multiplying inputs with K, Q and V weight matrices
        K = torch.matmul(X, WK)
        V = torch.matmul(X, WV)
        Q = torch.matmul(inputs, WQ)
            
        # apply scaled dot-product attention
        attention_scores = self.scaled_dot_product(Q, K, V)
        
        return attention_scores

>An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. <sub>-from section 3.2 Attention<sub>

>In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V. <sub>-from section 3.2.1 Scaled Dot-Product Attention<sub>

It can be inferred from the paper that K, Q and V weight matrices are distinct but in practice, in implementation, there is only **one weight matrix** and projected matrices are also one matrix. The reason for this is to implement the model with highly optimized matrix multiplication code.     
We will not implement this way in this notebook because we want the matrices to be distinguishable so that the code could be easier to follow.

<img src='assets/1/weight-impl.PNG' />

In decoder attention layer, we get the linearly projected K and V matrices by considering encoder output.
<img src='assets/1/encoder-out.PNG' width=50% height=50% align='left'/>

Let's do a unit test on Attention implementation.

<img src='assets/1/attention-matrix.PNG'/>

In [7]:
# from section 3.1 Encoder and Decoder Stacks
d_model = 512    # embedding dimensionsionality

b = 2   # batch size

# define input
# b, n, d_model -> batch size, sequence length(number of embeddings), embedding dimensionality
X = torch.randn(b, n, d_model)

# init attention 
attention = Attention(d_model, d_key)

# compute attention scores
attention_scores = attention(X)

attention_scores.shape

torch.Size([2, 3, 64])

## Multi-Head Attention

<img src='assets/1/multi-head-attention-3.PNG' width=75% height=75%/>

In [8]:
# Multi-Head Attention consists of several attention layers running in parallel.       
class MultiHeadAttention(nn.Module):
    def __init__(self, n_head=8, d_model=512, mask=False):
        super(MultiHeadAttention, self).__init__()
        self.mask = mask
        
        # number of heads
        self.h = n_head
        
        # dimensionality of key vector
        assert d_model % n_head == 0
        d_key = int(d_model / n_head)
        
        # add attention layers to a ModuleList container
        attention_list = [Attention(d_model, d_key, mask) for _ in range(n_head)]
        self.multi_head_attention = nn.ModuleList(attention_list)
        
        # define linear layer 
        self.W = nn.Parameter(torch.randn(n_head*d_key, d_model))
    
    def forward(self, x, encoder_output=None):
        # apply & concat attention layers
        attention_scores = [attention(x, encoder_output) for attention in self.multi_head_attention]
        Z = torch.cat(attention_scores, -1)
        
        # apply linear transformation
        output = torch.matmul(Z, self.W)    # reduce dimensionality
        
        return output
    
# In this work we employ h = 8 parallel attention layers, or heads. 
# For each of these we use d_k = d_v = d_model/h = 64
# - from section 3.2.2 Multi-Head Attention
# We can conclude that d_key is not a hyperparameter
# and that d_model should be divisible by the number of heads.        

Let's do a unit test on Multi-Head Attention implementation.

<img src='assets/1/multi-head-matrix.PNG'/>

In [9]:
# from section 3.2.2 Multi-Head Attention
h = 8 # number of heads

# define input
X = torch.randn(b, n, d_model)

# init multi-head attention 
multi_head_attention = MultiHeadAttention(h, d_model)

# compute multi_head attention score
Z = multi_head_attention(X)

Z.shape

torch.Size([2, 3, 512])

**Parallelization of Multi-Head Attention in Code**

Multi-Head attention works with Attention layers running in parallel. However, currently there does  not exist an official implementation in PyTorch for parallel modules. Developers are using third-party libraries in their code to utilize parallelism. <br>
Executing the code on GPU avoids sequential execution since underlying execution is asynchronous. Additionally, **if we have multiple GPUs in our system (and don’t use data parallel), we could execute different modules on each device and concatenate the result back on a single device.**

## Feed Forward Neural Network
<img src='assets/1/ffnn.PNG' width=75% height=75%/>


In [10]:
# The dimensionality of input and output is dmodel = 512, 
# and the inner-layer has dimensionality d_ff = 2048
# - from section 3.3 Position-wise Feed-Forward Networks

class FeedForwardNetwork(nn.Module):
    def __init__(self, d_model=512, d_feedforward=2048):
        super(FeedForwardNetwork, self).__init__()
        
        # define the feedforward neural network 
        self.feedforwardnn = nn.Sequential(   
                nn.Linear(d_model, d_feedforward),
                nn.ReLU(),
                nn.Linear(d_feedforward, d_model),
        )
        
    def forward(self, x):
        x = self.feedforwardnn(x)
        return x

In [11]:
# unit test of FeedForward network

# from section 3.3 Position-wise Feed-Forward Network
d_feedforward = 2048

# define input
X = torch.rand(b, n, d_model)

# define feedforward neural network
ffnn = FeedForwardNetwork(d_model, d_feedforward)

# compute ffnn output
output = ffnn(X)

output.shape

torch.Size([2, 3, 512])

## Sublayer


<img src='assets/1/sublayer.PNG' width=80% height=80%/> 

In [12]:
# We employ a residual connection around each of the two sub-layers, 
# followed by layer normalization. That is'The output of each sub-layer is
# LayerNorm(x + Sublayer(x)), where Sublayer(x) is
# the function implemented by the sub-layer itself.'
# from section - 3.1 Encoder and Decoder Stacks

# define residual learning block
class Residual_Connection(nn.Module):
    def __init__(self, layer):
        super(Residual_Connection, self).__init__()
        self.layer = layer     # function implemented by the layer

    def forward(self, x):
        f_x = self.layer(x)                    # apply Sublayer(x)
        x = x + f_x                            # x + Sublayer(x)
        return x
    
# To facilitate these residual connections, all sub-layers in the model, as well as the embedding
# layers, produce outputs of dimension dmodel = 512
class Sublayer(nn.Module):
    def __init__(self, layer, d_model=512):
        super(Sublayer, self).__init__()
    
        self.res_connect = Residual_Connection(layer) 
        self.layer_norm = nn.LayerNorm(d_model)  # embedding vector length
        
    def forward(self, x):
        x = self.res_connect(x)
        x = self.layer_norm(x)
        return x

In [13]:
# unit test on sublayer

# define input
X = torch.rand(b, n, d_model)

# init feedforward neural network (with default settings) sublayer
nn_layer = Sublayer(FeedForwardNetwork())

# compute ffnn sublayer output
ffnn_output = nn_layer(X)

# init attention sublayer (with default settings)
attention_layer = Sublayer(MultiHeadAttention())

# compute attention sublayer output
attention_output = attention_layer(X)

print(ffnn_output.shape)
print(attention_output.shape)

torch.Size([2, 3, 512])
torch.Size([2, 3, 512])


## Encoder

<img src='assets/1/encoder.PNG' /> 

In [14]:
class EncoderwithSublayer(nn.Module):
    def __init__(self, n_head=8, d_model=512, d_feedforward=2048):
        super(Encoder, self).__init__()
        
        self.attention = MultiHeadAttention(n_head, d_model)
        self.feed_forward = FeedForwardNetwork(d_model, d_feedforward)
        
        # two sub-layers
        # 1. Self-Attention
        self.attention_layer = Sublayer(self.attention, d_model)
          
        # 2. Feedforward Neural Network
        self.feedforward_layer = Sublayer(self.feed_forward, d_model)
        
    def forward(self, x):
        # apply attention layer
        x = self.attention_layer(x)
        # apply feedforward network layer
        x = self.feedforward_layer(x)
        
        return x

In [15]:
# unit test on encoder

# define input
X = torch.rand(b, n, d_model)

# init encoder
encoder = EncoderWithSublayer(n_head=8, d_model=512, d_feedforward=2048)

# compute encoder output
output = encoder(X)

output.shape

torch.Size([2, 3, 512])

We will not utilize Sublayer module implemented above to be able to observe the flow in Encoder & Decoder. Apart from that, we will not add any dropout to residual parts since we will not do any training in this notebook.

In [18]:
class Encoder(nn.Module):
    def __init__(self, n_head=8, d_model=512, d_feedforward=2048):
        super(Encoder, self).__init__()
        
        self.attention = MultiHeadAttention(n_head, d_model)
        self.feed_forward = FeedForwardNetwork(d_model, d_feedforward)
        
        self.layer_norm1 = nn.LayerNorm(d_model)
        self.layer_norm2 = nn.LayerNorm(d_model)
        
    def forward(self, x):
        # pass through attention layer
        attention_scores = self.attention(x)
        # apply residual connection & layer normalization
        x = self.layer_norm1(attention_scores + x)  
        
        # pass through feedforward network layer
        scores = self.feed_forward(x)
        # apply residual connection & layer normalization
        x = self.layer_norm2(scores + x)
        
        return x

In [19]:
# unit test on encoder

# define input
X = torch.rand(b, n, d_model)

# init encoder
encoder = Encoder(n_head=8, d_model=512, d_feedforward=2048)

# compute encoder output
output = encoder(X)

output.shape

torch.Size([2, 3, 512])

## Decoder

<img src='assets/1/decoder.PNG' /> 

In [22]:
class Decoder(nn.Module):
    def __init__(self, n_head=8, d_model=512, d_feedforward=2048):
        super(Decoder, self).__init__()
        
        # three sub-layers
        # 1. Masked Self-Attention
        # 2. Self-Attention
        # 3. Feedforward Neural Network
        
        self.masked_attention = MultiHeadAttention(n_head, d_model, mask=True)
        self.attention = MultiHeadAttention(n_head, d_model)
        self.feedforward = FeedForwardNetwork(d_model, d_feedforward)
        
        self.layer_norm1 = nn.LayerNorm(d_model)
        self.layer_norm2 = nn.LayerNorm(d_model)
        self.layer_norm3 = nn.LayerNorm(d_model)
        
    def forward(self, x, encoder_output):
        # apply masked attention layer
        masked_attention_scores = self.masked_attention(x)
        # apply residual connection & layer normalization
        x = self.layer_norm1(masked_attention_scores + x)  
        
        # apply attention layer
        attention_scores = self.attention(x, encoder_output)
        x = self.layer_norm2(attention_scores + x)
        
        # apply feedforward network layer
        scores = self.feedforward(x)
        x = self.layer_norm3(scores + x)
        
        return x

In [23]:
# unit test on decoder

# define input for decoder (remember that decoder input is the expected output)
y = torch.rand(b, n, d_model)   
encoder_output = torch.rand(b, n, d_model)

# init decoder
decoder = Decoder(n_head=8, d_model=512, d_feedforward=2048)

# compute decoder output
output = decoder(y, encoder_output)

output.shape

torch.Size([2, 3, 512])

## Positional Encoding 

(WIP)

## Transformer Model

<img src='assets/1/transformer-model.PNG' width=80% height=80%/> 

In [33]:
# The encoders are all identical in structure (yet they do not share weights)

class Transformer(nn.Module):
    
    # n is the number of stacked layers of encoders & decoders
    def __init__(self, N=6, n_head=8, d_model=512, d_ff=2048):
        super(Transformer, self).__init__()
        
        # initialize the encoder and decoder stacks with sequential container
        encoders, decoders = [], []
        
        # stack encoders
        for i in range(N):
            encoder = Encoder(n_head, d_model, d_ff)
            encoders.append(encoder)
        
        # stack decoders
        for i in range(N):
            decoder = Decoder(n_head, d_model, d_ff)
            decoders.append(decoder)
            
        self.encoders = nn.Sequential(*encoders)
        self.decoders = nn.ModuleList(decoders) 
        
        # define linear layer
        self.W = nn.Parameter(torch.randn(n_head*d_key, d_model))
        
        # define softmax layer
        self.softmax = nn.Softmax(dim=-1)
        
    def forward(self, x, y):
        # embed and get positional
        
        # pass through encoders and get the output
        encoder_output = self.encoders(x)
        
        # pass through decoders
        decoder_output = y
        for decoder in self.decoders:
            decoder_output = decoder(decoder_output, encoder_output)
        
        # pass through linear layer
        scores = torch.matmul(decoder_output, self.W)
        
        # pass through softmax
        output = self.softmax(scores)
        
        return output

In [34]:
# unit test for transformer model

# -> we define decoders as ModuleList 
# since Sequential container does not accept multiple inputs for forward method

# from section 3.1 Encoder and Decoder Stacks
N = 6

# define input
X = torch.rand(b, n, d_model)

# define output
y = torch.rand(b, n, d_model)

# init transformer
transformer = Transformer(N=6, n_head=8, d_model=512, d_ff=2048)

# compute output
output = transformer(X, y)

output.shape

torch.Size([2, 3, 512])

##  Comparing with PyTorch Transformer Implementation

with nn.Transformers

**Transformers Success**

**Transformers Weakness**



### Quick Recap of Familiar Components

<img src='assets/1/components.PNG'/> 

### Final Notes

The implementation here is a very basic representation of how Transformers work. There are a lot of lacking components that are avoided here to preserve the simplicity of the code. There are also a lot of improvements that can be made in the implementation here, and calculations can be further optimized.

### Resources

1. [Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) - Blog Post
2. [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf) - Paper
3. [Transformers for Beginners | What are they and how do they work](https://www.youtube.com/watch?v=_UVfwBqcnbM&t=4s) - Video
4. [Implementing a Transformer from Scratch](https://towardsdatascience.com/7-things-you-didnt-know-about-the-transformer-a70d93ced6b2) - Blog Post