# Transformers from Scratch

References: 
- [Transformer Anatomy](https://github.com/nlp-with-transformers/notebooks/blob/main/03_transformer-anatomy.ipynb)

## Attention Weights Visualization

In [23]:
import torch
import torch.nn.functional as F
from transformers import AutoConfig
from transformers import AutoTokenizer
from transformers import AutoModel
from bertviz import head_view 
from bertviz.transformers_neuron_view import BertModel 
from bertviz.neuron_view import show
from math import sqrt
from torch import n

In [2]:
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)
text = "time flies like an arrow"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)

Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 9.83kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 570/570 [00:00<00:00, 424kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 352kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:04<00:00, 110kB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Transformer Architecture

### Simplified Self-Attention

In [16]:
# Tokezine text
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids

tensor([[ 2051, 10029,  2066,  2019,  8612]])

In [7]:
# Create embedding layer
config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb  # [vocab_size, hidden_size]

Embedding(30522, 768)

Here, token embeddings don't have any context. This means that homonyms (words that have the same spelling but different meaning), like "flies" in the previous example, have the same representation. 

In [6]:
# Generate embeddings from token ids
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()  # [batch_size, seq_len, hidden_dim]

torch.Size([1, 5, 768])

In [11]:
# Create query, key and value, kept equal for simplicity
query = key = value = inputs_embeds  
dim_k = key.size(-1)
# Calculate the attention scores using the dot product as the similarity function:
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
scores.size()

torch.Size([1, 5, 5])

In [13]:
# Normalization
weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)

tensor([[1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

In [14]:
weights.size()

torch.Size([1, 5, 5])

In [15]:
# Calculate attention
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape

torch.Size([1, 5, 768])

This attention mechanism with equal query and key vectors will assign a very large score to identical words in the context, and in particular to the current word itself: the dot product of a query with itself is always 1.

### Multi-head Attention

In [18]:
def scaled_dot_product_attention(query, key, value):
    """
    1. Calculate attention score from query, key and value.
    2. Compute attention weights, normalizing the attention scores.
    3. Update the token embeddings.
    """
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k) 
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

In [19]:
class AttentionHead(nn.Module):
    """
    Self-attention head.
    
    Args:
        embed_dim: embedding dimension.
        head_dim: number of dimensions we are projecting into.
        
    Notes:
        Although `head_dim` does not have to be smaller 
        than the number of embedding dimensions of the tokens (embed_dim), 
        in practice it is chosen to be a multiple of `embed_dim` 
        so that the computation across each head is constant.
    """
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    def forward(self, hidden_state):
        attn_outputs = scaled_dot_product_attention(
            self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
        return attn_outputs  # [batch_size, seq_len, head_dim]

In [20]:
class MultiHeadAttention(nn.Module):
    """
    Multi-head attention.
    
    Args:
        config: multi-head attention configuration.
    """
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, hidden_state):
        """
        Concatenate the outputs of each self-attention layer.
        """
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
        x = self.output_linear(x)
        return x  # [batch_size, seq_len, hidden_dim]     

In [21]:
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)    
attn_output.size() 

torch.Size([1, 5, 768])

In [24]:
model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)

sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])

Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 440M/440M [00:44<00:00, 9.88MB/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClas

<IPython.core.display.Javascript object>