In [None]:
!pip install transformers
!pip install bertviz

This chapter dives deep into the Transformer encoder/decoder architectures. It starts by explaining encode model with a simple implementation of the attention. Then, multihead attention is given followed by the positional embeddings, layer normalization and skip connections. Combining all, we obtain a Transformer Encoder. Then the differences between encoder and decoder given as the masking and the attention over encoders' vectors. The chapter concludes by giving the sota models and the classification of the architectures. 


* Dense vs sparse 
* positional embeddign: absolute vs learnable, relative 

# Transformer Architecture 

Encoder-decoder models takes a sequence of words and generates another sequence of words related to the task like machine translation, summarization etc. 

The encoder converts the input sequences into a sequence of embeddings, called hidden state. The decoder takes the hidden states from the encoder and generates the output sequence. Both of them have a layered architecture. 

The models can be divided into three groups depending on their architecture:
* Encoder-only: These models converts the given input sequence into representations. They are mostly used for text classification, NER. BERT-like models.
* Decoder-only: These models generates a sequence of words when given an uncomplete start. GPT-like models. 
* Encoder-decoder: These models are suitable for machine translation, summarization. BART and T5 models. 



## Encoder 

Each encoder block consists of a multi-head self-attention followed by a feed forward network. The output of the each encoder is the same size as the input of the encoder. 

* The role of the encoder block is to update the input embeddings to produce representations that encodes contextual information in the sequence. 

### Self-attention 

The "self" part refers to that the weights are computed for all hidden states in the same set. 

The idea is that instead of using a fixed embedding for each token, we can use the whole sequence to computed a weighted average of each embedding. 

**Formulation:** For the token embeddings are ${x_1,...,x_n}$ , the new embeddings are $x_{i}'= \sum_{j=1}^n w_{ji}x_j$ where $w_{ji}$ are called the attention weights and normalized to have $\sum_{j}w_{ji}=1$. 

These new embeddings are called *contextualized embeddings*.  

**Scaled dot-product attention**

The steps to compute the attention:
1. For each token embedding, make a projection over three vectors called query, key and value. 
2. Compute the attention scores between query and key vectors. The attention scores show the similarity between these two vectors by computing dot product between the vectors. The output attention scores are $n \times n$ matrix for a sequence with $n$ tokens. 

3. Compute attention weights by first multiplying the attention scores with a scaling factor, then normalizing with a softmax function. The resulting matrix ($n \times n$) contains the attention weights $w_{ji}$. 

4. Multiply the value vector with the weights to obtain the updated token embeddings:  $x_{i}' = \sum_{j}w_{ji}v_j$.


In [2]:
# Visualize the attentions 
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)
text = "time flies like an arrow"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)

Output hidden; open in https://colab.research.google.com to view.

In [4]:
# First tokenize the text
# [CLS] and [SEP] tokens are expluded with add_special_tokens=False 
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
print(inputs.input_ids)
print(text)

tensor([[ 2051, 10029,  2066,  2019,  8612]])
time flies like an arrow


In [5]:
# Create a dense embedding matrix for the model's vocabulary
from torch import nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb

Embedding(30522, 768)

In [7]:
# Get the embeddings for the inputs 
# The resulting tensor [batch_size, seq_len, hidden_dim]
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

torch.Size([1, 5, 768])

In [8]:
# For now, we postpone the positional embeddings 
# Create key, value and query matrices 
import torch
from math import sqrt

query = key = value = inputs_embeds
dim_k = key.size(-1)
print("Dimension is ", dim_k)

Dimension is  768


`torch.bmm()` performs a batch matrix-matrix product for inputs of shape `[batch_size, seq_len, hidden_dim]`. 

In [9]:
print("Calculating the attention score with a dot-product between key and query")
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
scores.size()

Calculating the attention score with a dot-product between key and query


torch.Size([1, 5, 5])

In [11]:
# Now apply the softmax over the scaled dot-product 
import torch.nn.functional as F

weights = F.softmax(scores, dim=-1)
print("Sum of the wieghts " , weights.sum(dim=-1))
print("Shape of the weights ", weights.shape)

# Calculate the weighted vectors 
attn_outputs = torch.bmm(weights, value)
print("Shape of weighted vectors ", attn_outputs.shape)


Sum of the wieghts  tensor([[1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)
Shape of the weights  torch.Size([1, 5, 5])
Shape of weighted vectors  torch.Size([1, 5, 768])


In summary, self-attention contains two matrix multiplication and a softmax function. It is simply a form of averaging. 


In [12]:
# Forming functions to be used later 
def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

**Multi-head Attention**

In practice, the self-attention layer applies three independent linear transformations ot each embedding to generate the query, key and value vectors. Each linear projection is called attention head, thus forming multi-head attention layer. Having several heads allow the model to focus on several aspects at once. 


In [16]:
# A single attention head 
# head_dim: nbr of dim that we are projecting into 
# For example, BERT has 12 attention heads 
#      so head_dim in each is 768/12 = 64
class AttentionHead(nn.Module):
  def __init__(self, embed_dim, head_dim):
    super().__init__() 
    self.q = nn.Linear(embed_dim, head_dim) 
    self.k = nn.Linear(embed_dim, head_dim)
    self.v = nn.Linear(embed_dim, head_dim)
  def forward(self, hidden_state):
    attn_outputs = scaled_dot_product_attention(self.q(hidden_state), 
                                                self.k(hidden_state),
                                                self.v(hidden_state))
    return attn_outputs

In [17]:
# Create Multi-Head attention layer 
class MultiHeadAttention(nn.Module):
  def __init__(self,config):
    super().__init__()
    embed_dim = config.hidden_size 
    num_heads = config.num_attention_heads 
    head_dim = embed_dim // num_heads 
    self.heads = nn.ModuleList(
        [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
    )
    self.output_linear = nn.Linear(embed_dim,embed_dim)
  def forward(self, hidden_state):
    x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
    x = self.output_linear(x)
    return x

In [22]:
# What does multi-head layer generate? 
# config is downloaded for the BERT model 
multihead_attn = MultiHeadAttention(config)
# input_embeds are the vectors of shape [batch_size, seq_len, embed_dim]
attn_output = multihead_attn(inputs_embeds)
# the output is also the same shape 
attn_output.size()

torch.Size([1, 5, 768])

In [23]:
# Visualize attention from a pre-trained model 
from bertviz import head_view
from transformers import AutoModel

model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)

sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])


Output hidden; open in https://colab.research.google.com to view.

### Feed-Forward Layer 

This sublayer is a two-layer fully connected NN, but instead of processing the whole embeddings as a single vector, it processes each embedding independently. It is also called as *position-wise feed-forward layer*. 

This is where the most scaling is applied. 

In [28]:
class FeedForward(nn.Module):
  def __init__(self,config): 
    super().__init__()
    self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
    self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
    self.gelu = nn.GELU()
    self.dropout = nn.Dropout(config.hidden_dropout_prob)
  def forward(self,x): 
    x = self.linear_1(x)
    x = self.gelu(x)
    x = self.linear_2(x)
    x = self.dropout(x)
    return x 

In [29]:
# Testing the feed forward function 
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_outputs)
ff_outputs.size()

torch.Size([1, 5, 768])

### Adding Layer Normalization

Layer normalization normalizes each input in the batch to have zero mean and unity variance. 

Skip connections pass a tensor to the next layer of the model without processing and add it to the next layer. 

1. Post layer normalization: In the original transformer paper, the layer normalization is placed between the skip connections. This may cause the gradients to diverge with a learning rate warm-up result. 

2. Pre layer normalization: places the normalizaation within the span of the skip connections. This tend to be much more stable during training. 


In [32]:
# Final encoder block 
class TransformerEncoderLayer(nn.Module):
  def __init__(self,config): 
    super().__init__()
    self.layer_norm_1 = nn.LayerNorm(config.hidden_size) 
    self.layer_norm_2 = nn.LayerNorm(config.hidden_size) 
    self.attention = MultiHeadAttention(config)
    self.feed_forward = FeedForward(config) 
  def forward(self, x):
    # apply layer normalization 
    hidden_state = self.layer_norm_1(x)
    # apply attention with a skip connection 
    x = x + self.attention(hidden_state) 
    # apply feed-forward with a skip connection 
    x = x + self.feed_forward(self.layer_norm_2(x))
    return x 

In [33]:
# Test it with the embeddings 
encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()

(torch.Size([1, 5, 768]), torch.Size([1, 5, 768]))

In [48]:
# What happens in the layer normalization?
normalized_embeds = encoder_layer.layer_norm_1(inputs_embeds)
torch.sum(normalized_embeds,2)

tensor([[-3.8147e-06,  2.8610e-06, -9.5367e-07,  1.1444e-05,  1.9073e-06]],
       grad_fn=<SumBackward1>)

In [49]:
# How does skip connection work? 
inputs_embeds = inputs_embeds + encoder_layer.attention(normalized_embeds)

In [51]:
inputs_embeds.shape

torch.Size([1, 5, 768])

### Positional Embeddings 

Until now, the implemented model works but it doesn't know about the positions of the tokens. The way to incorporate it into model is using positional embeddings. The idea is to augment the token embeddings with a position-dependent pattern of values arranged in a vector. 

One of the most popular way is to use a learnable pattern which works the same way as the token embeddings. 

Now, we cretate a custom Embeddings module that combines a token embedding layer that projects the input otkens with a positional embeddings that project the position ids. 

In [59]:
# Creating an embedding 
class Embeddings(nn.Module):
  def __init__(self, config): 
    super().__init__()
    self.token_embeddings = nn.Embedding(config.vocab_size,
                                         config.hidden_size)
    self.position_embeddings = nn.Embedding(config.max_position_embeddings,
                                            config.hidden_size)
    self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
    self.dropout = nn.Dropout()
  def forward(self, input_ids):
    # create position ids for input sequence 
    seq_length = input_ids.size(1)
    position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
    # create token and position embeddings 
    token_embeddings = self.token_embeddings(input_ids)
    position_embeddings = self.position_embeddings(position_ids)
    # Combine token and position embeddings 
    embeddings = token_embeddings + position_embeddings  
    embeddings = self.layer_norm(embeddings)
    embeddings = self.dropout(embeddings)
    return embeddings 


In [60]:
# Testing the embedding layer 
embedding_layer = Embeddings(config)
embedding_layer(inputs.input_ids).size()

torch.Size([1, 5, 768])

Finally, we have a Transformer Encoder layer: 

In [63]:
class TransformerEncoder(nn.Module):
  def __init__(self, config): 
    super().__init__()
    self.embeddings = Embeddings(config)
    self.layers = nn.ModuleList([TransformerEncoderLayer(config) for _ in range(config.num_hidden_layers)])
  def forward(self,x):
    x = self.embeddings(x)
    for layer in self.layers:
      x = layer(x)
    return x 

In [64]:
encoder = TransformerEncoder(config)
encoder(inputs.input_ids).size()

torch.Size([1, 5, 768])

### Adding a classification Head 

Until now, we have an encoder that generates hidden state for each input token. This is the body of the Transformer. To use the model for prediction, we need to add a task-specific head like a classification head. 

In [66]:
class TransformerForSequenceClassification(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.encoder = TransformerEncoder(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        x = self.dropout(x)
        x = self.classifier(x)
        return x

In [67]:
# Defining the number of classes in the task  
config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size()

torch.Size([1, 3])

For each input in the batch (1 in this example), we get unnnormalized logits defining each class (3 in the example). 

## Decoder 

The main differences between decoder and encoder are: 

* Masked multi-head self-attention layer: It ensures that only past outputs are used during current token prediction by masking targets of the steps.  

* Encoder-decoder attention layer: It is a multi-head attention over the output key and value vectors of the encoder stack with the intermediate representations of the decoder acting as the queries. 

In [68]:
# creating a mask
seq_len = inputs.input_ids.size(-1)
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)
mask[0]

tensor([[1., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1.]])

In [69]:
scores.masked_fill(mask == 0, -float("inf"))

tensor([[[28.8601,    -inf,    -inf,    -inf,    -inf],
         [ 0.1249, 25.3669,    -inf,    -inf,    -inf],
         [-1.4536,  0.7622, 26.2898,    -inf,    -inf],
         [-1.2275, -0.7349,  0.2297, 28.1567,    -inf],
         [ 0.8334, -0.5028,  0.3620,  0.3227, 27.6891]]],
       grad_fn=<MaskedFillBackward0>)

In [70]:
# New dot product with mask variable
def scaled_dot_product_attention(query, key, value, mask=None):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)
    return weights.bmm(value)