<img style="float: right;" src="../../assets/htwlogo.svg">

# Exercise: Studying Attention Layers

**Author**: _Erik Rodner_ <br>

In this exercise, we will analyze the scaled dot-product attention.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn.functional as F
from transformers import BertTokenizer

## Tokenization

Let's first tokenize some text without any purpose really :)

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenization and input preparation
sentence = "Transformers are powerful models for natural language processing."
tokens = tokenizer.tokenize(sentence)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_tensor = torch.tensor([input_ids])

print(f"Sentence: '{sentence}'")
print(f"Tokens: {tokens}")
print(f"Input IDs: {input_ids}")

## Generate synthetic embedding data 

For simplicity, we'll use random values with a rather low dimension here. 
In a real setting, the embeddings could be initially also random but also tuned during training.

In [None]:
embedding_dim = 8
# the following construction also ignores the fact that initially embeddings should be the same for the same token
data = torch.rand((len(input_ids), embedding_dim))
print(f"\nGenerated Embedding Shape: {data.shape}")

## Transformer Layer in Action: Scaled Dot Product Attention

Let's first generate queries, keys, and values.
Our $Q$, $K$, $V$ matrices are then computed by applying the embedding matrix to them.

In [None]:
dk = 4 # dimension of the query and key vectors
dv = 4 # dimension of the value vectors
query_weights = torch.rand((embedding_dim, dk))
key_weights = torch.rand((embedding_dim, dk))
value_weights = torch.rand((embedding_dim, dv))

Q = torch.matmul(data, query_weights)
K = torch.matmul(data, key_weights)
V = torch.matmul(data, value_weights)

print(f"Query (Q) Shape: {Q.shape}\n", Q)
print(f"Key (K) Shape: {K.shape}\n", K)
print(f"Value (V) Shape: {V.shape}\n", V)

## Scaled dot-product attention

Let's apply scaled dot-product attention step-by-step.

**Exercise 1**: complete the following function to compute the attention scores

In [None]:
def compute_attention_scores(Q, K):
    dk = Q.size(-1)
    scores = 0 # YOUR CODE HERE: compute the dot product between Q and K properly :)
    attn_probs = F.softmax(scores, dim=-1)
    return attn_probs

attention_scores = compute_attention_scores(Q, K)
print(f"Attention Scores Shape: {attention_scores.shape}\n", attention_scores)

**Exercise 2**: complete now the following function to compute the final embedding.

In [None]:
def compute_weighted_values(attention_scores, V):
    return 0 # YOUR CODE HERE: compute the weighted values properly :)

weighted_values = compute_weighted_values(attention_scores, V)
print(f"Weighted Values Shape: {weighted_values.shape}\n", weighted_values)

## Visualization of the attention scores

Let's visualize the attention scores in the following. Of course they are all random, but you get an idea of their shape.

In [None]:
# Visualization of Attention Weights
fig, ax = plt.subplots(figsize=(10, 6))
cax = ax.matshow(attention_scores.detach().numpy(), cmap='viridis')
plt.title("Attention Scores Heatmap")
plt.xticks(range(len(tokens)), tokens, rotation=90)
plt.yticks(range(len(tokens)), tokens)
fig.colorbar(cax)
plt.show()