# Transformers Basics

References:
- [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [NLP with Transformers](https://www.amazon.com.au/Natural-Language-Processing-Transformers-Applications/dp/1098103246)
- [Transformer Anatomy](https://github.com/nlp-with-transformers/notebooks/blob/main/03_transformer-anatomy.ipynb)

<a href="https://colab.research.google.com/github/paulaceccon/deep-learning-studies/blob/main/notebooks/transformers/tranformer_basics.ipynb" target="_parent" style="float: left;"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Attention Weights Visualization

In [1]:
import sys
import torch
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import AutoConfig, AutoModel, AutoTokenizer
from bertviz import head_view
from bertviz.transformers_neuron_view import BertModel, BertTokenizer
from bertviz.neuron_view import show
from math import sqrt
from torch import nn
from loguru import logger
from typing import Optional
import matplotlib.pyplot as plt

In [2]:
model_ckpt = "bert-base-uncased"
model = BertModel.from_pretrained(model_ckpt, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_ckpt, do_lower_case=True)

In [3]:
# This exposes the query vectors, key vectors, and other intermediate representations used to compute the attention weights.
# Each color band represents a single neuron value, where color intensity indicates the magnitude and hue the sign
# (blue=positive, orange=negative).
text = "The quick brown fox jumps over the lazy dog"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)

In [4]:
# Tokenize text
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids

tensor([[ 1996,  4248,  2829,  4419, 14523,  2058,  1996, 13971,  3899]])

In [5]:
# Create embedding layer
config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb  # [vocab_size, hidden_size]

Embedding(30522, 768)

Here, token embeddings don't have any context. This means that homonyms (words that have the same spelling but different meaning), like "fox" in the previous example, have the same representation.

In [6]:
# Generate embeddings from token ids
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()  # [batch_size, seq_len, hidden_dim]

torch.Size([1, 9, 768])

In [7]:
# Create query, key and value, kept equal for simplicity
query = key = value = inputs_embeds
dim_q = query.size(-1)
# Calculate the attention scores using the dot product as the similarity function:
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_q)
print(query.size(), key.transpose(1,2).size(), scores.size())

torch.Size([1, 9, 768]) torch.Size([1, 768, 9]) torch.Size([1, 9, 9])


In [8]:
# Normalization
weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)

tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

In [9]:
weights.size()

torch.Size([1, 9, 9])

In [10]:
# Calculate attention
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape

torch.Size([1, 9, 768])

This attention mechanism with equal query and key vectors will assign a very large score to identical words in the context, and in particular to the current word itself: the dot product of a query with itself is always 1.

The previus code can be rewritten in a single method that defines the scaled dot product attention mechanism:

In [11]:
def scaled_dot_product_attention(
    query: torch.Tensor, 
    key: torch.Tensor, 
    value: torch.Tensor, 
    mask: Optional[torch.Tensor]=None
) -> torch.Tensor:
    """
    Compute scaled dot product attention weights.

    Args:
        query: Tensor with shape [batch_size, seq_length_q, depth_q].
        key: Tensor with shape [batch_size, seq_length_k, depth_k].
        value: Tensor with shape [batch_size, seq_length_v, depth_v].
        mask: Optional tensor with shape [batch_size, seq_length_q, seq_length_k],
            containing values to be masked. Default is None.

    Returns:
        Tensor with shape [batch_size, seq_length_q, depth_v].
    """
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

In [12]:
class AttentionHead(nn.Module):
    """
    Self-attention head.
    
    Args:
        embed_dim: embedding dimension.
        head_dim: number of dimensions we are projecting into.
        
    Notes:
        Although `head_dim` does not have to be smaller 
        than the number of embedding dimensions of the tokens (embed_dim), 
        in practice it is chosen to be a multiple of `embed_dim` 
        so that the computation across each head is constant.
    """
    def __init__(self, embed_dim: int, head_dim: int , mask: torch.Tensor=None) -> None:
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)
        self.mask = mask

    def forward(self, hidden_state: torch.Tensor) -> torch.Tensor:
        """
        Perform a forward pass of the attention head.

        Args:
            hidden_state: Input tensor of shape [batch_size, seq_len, embed_dim].

        Returns:
            Tensor of shape [batch_size, seq_len, head_dim], representing the attention outputs.
        """
        attn_outputs = scaled_dot_product_attention(
            self.q(hidden_state), self.k(hidden_state), self.v(hidden_state), mask=self.mask)
        return attn_outputs  

In [13]:
class MultiHeadAttention(nn.Module):
    """
    Multi-head attention module.

    Args:
        config: Configuration for the multi-head attention.
        mask: Optional mask tensor. Default is None.
    """
    def __init__(self, config, mask: torch.Tensor=None) -> None:
        super().__init__()
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        logger.debug(f"hidden_dim: {embed_dim}")
        logger.debug(f"num_heads: {num_heads}")
        
        assert embed_dim % num_heads == 0
        head_dim = embed_dim // num_heads
        logger.debug(f"head_dim: {head_dim}")
        
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim, mask) for _ in range(num_heads)]
        )
        logger.debug(f"Attention heads: {self.heads}")
        self.output_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, hidden_state: torch.Tensor) -> torch.Tensor:
        """
        Perform a forward pass of the multi-head attention.

        Args:
            hidden_state: Input tensor of shape [batch_size, seq_len, embed_dim].

        Returns:
            Tensor of shape [batch_size, seq_len, embed_dim], 
            representing the output of the multi-head attention.
        """
        logger.debug(f"head_size: {self.heads[0](hidden_state).size()}")
        # [batch_size, seq_len, (num_heads * head_dim) = hidden_dim] 
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1) 
        x = self.output_linear(x)
        logger.debug(x.size())
        return x  

In [14]:
logger.debug(f"input_size: {inputs_embeds.size()}")
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)    

[32m2023-07-30 17:03:54.294[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [34m[1minput_size: torch.Size([1, 9, 768])[0m
[32m2023-07-30 17:03:54.295[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m13[0m - [34m[1mhidden_dim: 768[0m
[32m2023-07-30 17:03:54.295[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m14[0m - [34m[1mnum_heads: 12[0m
[32m2023-07-30 17:03:54.295[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m18[0m - [34m[1mhead_dim: 64[0m
[32m2023-07-30 17:03:54.302[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m23[0m - [34m[1mAttention heads: ModuleList(
  (0-11): 12 x AttentionHead(
    (q): Linear(in_features=768, out_features=64, bias=True)
    (k): Linear(in_features=768, out_features=64, bias=True)
    (v): Linear(in_features=768, out_features=64, bias=True)
  )
)[0m
[32m2023-07-30 17:03:54.305[0m | [34m[1mDEBUG   [0m | [36m

In [15]:
model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

sentence_a = "The quick brown fox jumps over the lazy dog"
sentence_b = "How quickly daft jumping zebras vex!"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<IPython.core.display.Javascript object>