# Coding Attention Mechanisms

Now we will work on coding attention mechanisms.

**Attention** is a mechanism that lets a model decide which parts of the input are most relevant when generating each part of the output.

**Self-Attention** is a specific type of attention that allows each position in the input sequence to consider the relevancy of, or “attend to,” all other positions in the same sequence when computing the representation. Self-attention is a key component of contemporary LLMs based on the Transformer architecture, such as the GPT series.

## The “Self” in Self-Attention

In self-attention, the “self” refers to the mechanism’s ability to compute attention weights by relating different positions within a single input sequence. It assesses and learns the relationships and dependencies between various parts of the input itself, such as words in a sentence or pixels in an image.

This is in contrast to traditional attention mechanisms, where the focus is on the relationships between elements of two different sequences, such as in sequence-to-sequence models. For example, attention might be computed between an input sequence and an output sequence in translation models.

## How Attention Works

In a Transformer (the architecture behind LLMs), attention works using three main vectors for each token (word or subword):

- **Query (Q)** – What am I looking for?

- **Key (K)** – What do I contain?

- **Value (V)** – What information do I carry?

The model computes how much each word should “attend” to others by comparing the query of the current word with the keys of all other words.


In [1]:
import torch

# -----------------------------
# Input sequence embeddings
# Each row represents a token in the sequence (e.g., words) and their embedding vectors
# -----------------------------
inputs = torch.tensor(
    [
        [0.43, 0.15, 0.89],  # "Your"     (x^1)
        [0.55, 0.87, 0.66],  # "journey"  (x^2)
        [0.57, 0.85, 0.64],  # "starts"   (x^3)
        [0.22, 0.58, 0.33],  # "with"     (x^4)
        [0.77, 0.25, 0.10],  # "one"      (x^5)
        [0.05, 0.80, 0.55],  # "step"     (x^6)
    ]
)

# -----------------------------
# Select the query vector
# Here, we are computing attention for the second word "journey" (x^2)
# -----------------------------
query = inputs[1]

# -----------------------------
# Compute raw attention scores
# Dot product between the query and each input token
# Higher score = more relevant token for this query
# -----------------------------
attn_scores_2 = torch.empty(inputs.shape[0])  # preallocate tensor for scores
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query)

print("Raw attention scores:", attn_scores_2)
print(50 * "-")

# -----------------------------
# Normalize attention scores (simple sum normalization)
# Sum of weights = 1 (not ideal, usually softmax is better)
# -----------------------------
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("Attention weights (normalized by sum):", attn_weights_2_tmp)
print("Sum of weights:", attn_weights_2_tmp.sum())
print(50 * "-")


# -----------------------------
# Define naive softmax function
# Converts raw scores into probabilities (weights) between 0 and 1
# -----------------------------
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)


attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights (naive softmax):", attn_weights_2_naive)
print("Sum of weights:", attn_weights_2_naive.sum())
print(50 * "-")

# -----------------------------
# Use PyTorch's built-in softmax (numerically stable)
# -----------------------------
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print("Attention weights (PyTorch softmax):", attn_weights_2)
print("Sum of weights:", attn_weights_2.sum())
print(50 * "-")

# -----------------------------
# Compute context vector
# Weighted sum of all input embeddings based on attention weights
# This is the output of the self-attention mechanism for the query
# -----------------------------
context_vec_2 = torch.zeros(query.shape)  # initialize context vector
for i, x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i] * x_i

print("Context vector (output of attention):", context_vec_2)
print(50 * "-")


Raw attention scores: tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])
--------------------------------------------------
Attention weights (normalized by sum): tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum of weights: tensor(1.0000)
--------------------------------------------------
Attention weights (naive softmax): tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum of weights: tensor(1.)
--------------------------------------------------
Attention weights (PyTorch softmax): tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum of weights: tensor(1.)
--------------------------------------------------
Context vector (output of attention): tensor([0.4419, 0.6515, 0.5683])
--------------------------------------------------


In [2]:
# Computing attention weights for all tokens
import torch

# -----------------------------
# Input sequence embeddings
# Each row represents a token in the sequence (e.g., words) and their embedding vectors
# -----------------------------
inputs = torch.tensor(
    [
        [0.43, 0.15, 0.89],  # "Your"     (x^1)
        [0.55, 0.87, 0.66],  # "journey"  (x^2)
        [0.57, 0.85, 0.64],  # "starts"   (x^3)
        [0.22, 0.58, 0.33],  # "with"     (x^4)
        [0.77, 0.25, 0.10],  # "one"      (x^5)
        [0.05, 0.80, 0.55],  # "step"     (x^6)
    ]
)

# -----------------------------
# Compute raw attention scores manually
# attn_scores[i, j] = dot product between token i and token j
# This measures how much token i should attend to token j
# -----------------------------
attn_scores = torch.empty(6, 6)  # initialize 6x6 matrix
for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)  # dot product = similarity

print("Raw attention scores (manual):")
print(attn_scores)
print(50 * "-")

# -----------------------------
# Compute attention scores using matrix multiplication (faster)
# inputs @ inputs.T computes all dot products at once
# -----------------------------
attn_scores = inputs @ inputs.T
print("Raw attention scores (matrix multiplication):")
print(attn_scores)
print(50 * "-")

# -----------------------------
# Apply softmax along each row
# Converts raw scores into attention weights (probabilities) for each token
# Each row sums to 1
# -----------------------------
attn_weights = torch.softmax(attn_scores, dim=-1)
print("Attention weights (softmax):")
print(attn_weights)
print(50 * "-")

# -----------------------------

# Compute context vectors for all tokens
# Multiply attention weights with input embeddings
# This gives the new representation of each token after attending to all tokens
# -----------------------------
all_context_vecs = attn_weights @ inputs
print("Context vectors (output of self-attention):")
print(all_context_vecs)
print(50 * "-")


Raw attention scores (manual):
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])
--------------------------------------------------
Raw attention scores (matrix multiplication):
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])
--------------------------------------------------
Attention weights (softmax):
tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1

### THE RATIONALE BEHIND SCALED-DOT PRODUCT ATTENTION

The reason for the normalization by the embedding dimension size is to improve the training performance by avoiding small gradients. For instance, when scaling up the embedding dimension, which is typically greater than 1,000 for GPT-like LLMs, large dot products can result in very small gradients during backpropagation due to the softmax function applied to them. As dot products increase, the softmax function behaves more like a step function, resulting in gradients nearing zero. These small gradients can drastically slow down learning or cause training to stagnate.

The scaling by the square root of the embedding dimension is the reason why this self-attention mechanism is also called scaled-dot product attention.

**Look at implementation of TinySelfAttentionQKV**

Note that nn.Linear in SelfAttention_v2 uses a different weight initialization scheme as nn.Parameter(torch.rand(d_in, d_out)) used in SelfAttention_v1, which causes both mechanisms to produce different results. To check that both implementations, SelfAttention_v1 and SelfAttention_v2, are otherwise similar, we can transfer the weight matrices from a SelfAttention_v2 object to a SelfAttention_v1, such that both objects then produce the same results.

Your task is to correctly assign the weights from an instance of SelfAttention_v2 to an instance of SelfAttention_v1. To do this, you need to understand the relationship between the weights in both versions. (Hint: nn.Linear stores the weight matrix in a transposed form.) After the assignment, you should observe that both instances produce the same outputs.


In [8]:
# Testing of nn linear
import torch.nn as nn

d_in = 3
d_out = 2


query = nn.Linear(d_in, d_out, bias=False)
print(query)
print(query.weight)
print(50*"-")

Linear(in_features=3, out_features=2, bias=False)
Parameter containing:
tensor([[-0.1833,  0.2312,  0.4019],
        [ 0.1258,  0.0657,  0.1622]], requires_grad=True)
--------------------------------------------------


In [3]:
# Computing attention weights for all tokens
import torch
from tiny_gpt import TinySelfAttentionQKV, TinySelfAttentionQKVLinear  # Import our tiny self-attention module

# -----------------------------
# Input sequence embeddings
# Each row represents a token in the sequence (e.g., words) and their embedding vectors
# For example: "Your journey starts with one step"
# -----------------------------
inputs = torch.tensor(
    [
        [0.43, 0.15, 0.89],  # "Your"     (x^1)
        [0.55, 0.87, 0.66],  # "journey"  (x^2)
        [0.57, 0.85, 0.64],  # "starts"   (x^3)
        [0.22, 0.58, 0.33],  # "with"     (x^4)
        [0.77, 0.25, 0.10],  # "one"      (x^5)
        [0.05, 0.80, 0.55],  # "step"     (x^6)
    ]
)

# -----------------------------
# Prepare variables for TinySelfAttentionQKV
# d_in: input dimension (size of token embedding)
# d_out: output dimension of attention space (can be different from d_in)
# Here we use d_out=2 for simplicity
# -----------------------------
d_in = inputs.shape[1]  # 3, because each token has 3 features
d_out = 2               # project into 2-dimensional attention space

# -----------------------------
# Pick a single token to inspect (optional)
# x_2 = inputs[1]  # "journey"
# -----------------------------

# -----------------------------
# Set manual seed for reproducibility
# -----------------------------
torch.manual_seed(123)

# -----------------------------
# Instantiate the TinySelfAttentionQKV module
# This will create learnable matrices for query, key, and value projections
# -----------------------------
sa_v1 = TinySelfAttentionQKV(d_in, d_out)

# -----------------------------
# Forward pass: compute attention output for all tokens
# Each token attends to all other tokens and produces a new representation
# The output shape is (seq_len, d_out) = (6, 2)
# -----------------------------
context_vectors = sa_v1(inputs)

# Print the context vectors (output of self-attention)
print("Context vectors after self-attention:")
print(context_vectors)




torch.manual_seed(789)
sa_v2 = TinySelfAttentionQKVLinear(d_in, d_out)
print(sa_v2(inputs))


Context vectors after self-attention:
tensor([[0.2996, 0.8053],
        [0.3061, 0.8210],
        [0.3058, 0.8203],
        [0.2948, 0.7939],
        [0.2927, 0.7891],
        [0.2990, 0.8040]], grad_fn=<MmBackward0>)
tensor([[-0.0739,  0.0713],
        [-0.0748,  0.0703],
        [-0.0749,  0.0702],
        [-0.0760,  0.0685],
        [-0.0763,  0.0679],
        [-0.0754,  0.0693]], grad_fn=<MmBackward0>)



## Hiding Future Words with Causal Attention