# Coding Attention Mechanisms

Now we will work on coding attention mechanisms.

**Attention** is a mechanism that lets a model decide which parts of the input are most relevant when generating each part of the output.

**Self-Attention** is a specific type of attention that allows each position in the input sequence to consider the relevancy of, or “attend to,” all other positions in the same sequence when computing the representation. Self-attention is a key component of contemporary LLMs based on the Transformer architecture, such as the GPT series.

## The “Self” in Self-Attention

In self-attention, the “self” refers to the mechanism’s ability to compute attention weights by relating different positions within a single input sequence. It assesses and learns the relationships and dependencies between various parts of the input itself, such as words in a sentence or pixels in an image.

This is in contrast to traditional attention mechanisms, where the focus is on the relationships between elements of two different sequences, such as in sequence-to-sequence models. For example, attention might be computed between an input sequence and an output sequence in translation models.

## How Attention Works

In a Transformer (the architecture behind LLMs), attention works using three main vectors for each token (word or subword):

- **Query (Q)** – What am I looking for?

- **Key (K)** – What do I contain?

- **Value (V)** – What information do I carry?

The model computes how much each word should “attend” to others by comparing the query of the current word with the keys of all other words.


In [None]:
import torch

# -----------------------------
# Input sequence embeddings
# Each row represents a token in the sequence (e.g., words) and their embedding vectors
# -----------------------------
inputs = torch.tensor(
    [
        [0.43, 0.15, 0.89],  # "Your"     (x^1)
        [0.55, 0.87, 0.66],  # "journey"  (x^2)
        [0.57, 0.85, 0.64],  # "starts"   (x^3)
        [0.22, 0.58, 0.33],  # "with"     (x^4)
        [0.77, 0.25, 0.10],  # "one"      (x^5)
        [0.05, 0.80, 0.55],  # "step"     (x^6)
    ]
)

# -----------------------------
# Select the query vector
# Here, we are computing attention for the second word "journey" (x^2)
# -----------------------------
query = inputs[1]

# -----------------------------
# Compute raw attention scores
# Dot product between the query and each input token
# Higher score = more relevant token for this query
# -----------------------------
attn_scores_2 = torch.empty(inputs.shape[0])  # preallocate tensor for scores
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query)

print("Raw attention scores:", attn_scores_2)
print(50 * "-")

# -----------------------------
# Normalize attention scores (simple sum normalization)
# Sum of weights = 1 (not ideal, usually softmax is better)
# -----------------------------
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("Attention weights (normalized by sum):", attn_weights_2_tmp)
print("Sum of weights:", attn_weights_2_tmp.sum())
print(50 * "-")


# -----------------------------
# Define naive softmax function
# Converts raw scores into probabilities (weights) between 0 and 1
# -----------------------------
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)


attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights (naive softmax):", attn_weights_2_naive)
print("Sum of weights:", attn_weights_2_naive.sum())
print(50 * "-")

# -----------------------------
# Use PyTorch's built-in softmax (numerically stable)
# -----------------------------
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print("Attention weights (PyTorch softmax):", attn_weights_2)
print("Sum of weights:", attn_weights_2.sum())
print(50 * "-")

# -----------------------------
# Compute context vector
# Weighted sum of all input embeddings based on attention weights
# This is the output of the self-attention mechanism for the query
# -----------------------------
context_vec_2 = torch.zeros(query.shape)  # initialize context vector
for i, x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i] * x_i

print("Context vector (output of attention):", context_vec_2)
print(50 * "-")


In [None]:
# Computing attention weights for all tokens
import torch

# -----------------------------
# Input sequence embeddings
# Each row represents a token in the sequence (e.g., words) and their embedding vectors
# -----------------------------
inputs = torch.tensor(
    [
        [0.43, 0.15, 0.89],  # "Your"     (x^1)
        [0.55, 0.87, 0.66],  # "journey"  (x^2)
        [0.57, 0.85, 0.64],  # "starts"   (x^3)
        [0.22, 0.58, 0.33],  # "with"     (x^4)
        [0.77, 0.25, 0.10],  # "one"      (x^5)
        [0.05, 0.80, 0.55],  # "step"     (x^6)
    ]
)

# -----------------------------
# Compute raw attention scores manually
# attn_scores[i, j] = dot product between token i and token j
# This measures how much token i should attend to token j
# -----------------------------
attn_scores = torch.empty(6, 6)  # initialize 6x6 matrix
for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)  # dot product = similarity

print("Raw attention scores (manual):")
print(attn_scores)
print(50 * "-")

# -----------------------------
# Compute attention scores using matrix multiplication (faster)
# inputs @ inputs.T computes all dot products at once
# -----------------------------
attn_scores = inputs @ inputs.T
print("Raw attention scores (matrix multiplication):")
print(attn_scores)
print(50 * "-")

# -----------------------------
# Apply softmax along each row
# Converts raw scores into attention weights (probabilities) for each token
# Each row sums to 1
# -----------------------------
attn_weights = torch.softmax(attn_scores, dim=-1)
print("Attention weights (softmax):")
print(attn_weights)
print(50 * "-")

# -----------------------------

# Compute context vectors for all tokens
# Multiply attention weights with input embeddings
# This gives the new representation of each token after attending to all tokens
# -----------------------------
all_context_vecs = attn_weights @ inputs
print("Context vectors (output of self-attention):")
print(all_context_vecs)
print(50 * "-")


### THE RATIONALE BEHIND SCALED-DOT PRODUCT ATTENTION

The reason for the normalization by the embedding dimension size is to improve the training performance by avoiding small gradients. For instance, when scaling up the embedding dimension, which is typically greater than 1,000 for GPT-like LLMs, large dot products can result in very small gradients during backpropagation due to the softmax function applied to them. As dot products increase, the softmax function behaves more like a step function, resulting in gradients nearing zero. These small gradients can drastically slow down learning or cause training to stagnate.

The scaling by the square root of the embedding dimension is the reason why this self-attention mechanism is also called scaled-dot product attention.

**Look at implementation of TinySelfAttentionQKV**

Note that nn.Linear in SelfAttention_v2 uses a different weight initialization scheme as nn.Parameter(torch.rand(d_in, d_out)) used in SelfAttention_v1, which causes both mechanisms to produce different results. To check that both implementations, SelfAttention_v1 and SelfAttention_v2, are otherwise similar, we can transfer the weight matrices from a SelfAttention_v2 object to a SelfAttention_v1, such that both objects then produce the same results.

Your task is to correctly assign the weights from an instance of SelfAttention_v2 to an instance of SelfAttention_v1. To do this, you need to understand the relationship between the weights in both versions. (Hint: nn.Linear stores the weight matrix in a transposed form.) After the assignment, you should observe that both instances produce the same outputs.


In [None]:
# Testing of nn linear
import torch.nn as nn

d_in = 3
d_out = 2


query = nn.Linear(d_in, d_out, bias=False)
print(query)
print(query.weight)
print(50*"-")

In [None]:
# Computing attention weights for all tokens
import torch
from tiny_gpt import TinySelfAttentionQKV, TinySelfAttentionQKVLinear  # Import our tiny self-attention module

# -----------------------------
# Input sequence embeddings
# Each row represents a token in the sequence (e.g., words) and their embedding vectors
# For example: "Your journey starts with one step"
# -----------------------------
inputs = torch.tensor(
    [
        [0.43, 0.15, 0.89],  # "Your"     (x^1)
        [0.55, 0.87, 0.66],  # "journey"  (x^2)
        [0.57, 0.85, 0.64],  # "starts"   (x^3)
        [0.22, 0.58, 0.33],  # "with"     (x^4)
        [0.77, 0.25, 0.10],  # "one"      (x^5)
        [0.05, 0.80, 0.55],  # "step"     (x^6)
    ]
)

# -----------------------------
# Prepare variables for TinySelfAttentionQKV
# d_in: input dimension (size of token embedding)
# d_out: output dimension of attention space (can be different from d_in)
# Here we use d_out=2 for simplicity
# -----------------------------
d_in = inputs.shape[1]  # 3, because each token has 3 features
d_out = 2               # project into 2-dimensional attention space

# -----------------------------
# Pick a single token to inspect (optional)
# x_2 = inputs[1]  # "journey"
# -----------------------------

# -----------------------------
# Set manual seed for reproducibility
# -----------------------------
torch.manual_seed(123)

# -----------------------------
# Instantiate the TinySelfAttentionQKV module
# This will create learnable matrices for query, key, and value projections
# -----------------------------
sa_v1 = TinySelfAttentionQKV(d_in, d_out)

# -----------------------------
# Forward pass: compute attention output for all tokens
# Each token attends to all other tokens and produces a new representation
# The output shape is (seq_len, d_out) = (6, 2)
# -----------------------------
context_vectors = sa_v1(inputs)

# Print the context vectors (output of self-attention)
print("Context vectors after self-attention:")
print(context_vectors)




torch.manual_seed(789)
sa_v2 = TinySelfAttentionQKVLinear(d_in, d_out)
print(sa_v2(inputs))


## Hiding Future Words with Causal Attention

Causal attention, also known as masked attention, is a specialized form of self-attention. It restricts a model to only consider previous and current inputs in a sequence when processing any given token when computing attention scores. This is in contrast to the standard self-attention mechanism, which allows access to the entire input sequence at once.


In [None]:
import torch
from tiny_gpt import TinySelfAttentionQKV, TinySelfAttentionQKVLinear  

inputs = torch.tensor(
    [
        [0.43, 0.15, 0.89],  # "Your"     (x^1)
        [0.55, 0.87, 0.66],  # "journey"  (x^2)
        [0.57, 0.85, 0.64],  # "starts"   (x^3)
        [0.22, 0.58, 0.33],  # "with"     (x^4)
        [0.77, 0.25, 0.10],  # "one"      (x^5)
        [0.05, 0.80, 0.55],  # "step"     (x^6)
    ]
)

d_in = inputs.shape[1]  
d_out = 2              

torch.manual_seed(789)
sa_v2 = TinySelfAttentionQKVLinear(d_in, d_out)
print(sa_v2(inputs))

queries = sa_v2.w_query(inputs)
keys = sa_v2.w_key(inputs)
attention_scores = queries @ keys.T
attn_weights = torch.softmax(attention_scores / keys.shape[-1]**0.5, dim=-1)
print(attn_weights)


**Attention weights** = how much each input element should contribute to the current output.

For every token there is a attention weight, that tells how much it affects the other words.(sort off)

To hide the future words, attention weights of future words should be zero





In [None]:
context_length = attention_scores.shape[0]

# tril lower triangular matrix
mask_simple = torch.tril(torch.ones(context_length, context_length))

print(f"tril functions ... {mask_simple}")
print(50*"-")


masked_simple = attn_weights*mask_simple
print(masked_simple)
print(50*"-")


row_sums = masked_simple.sum(dim=-1, keepdim=True)
masked_simple_norm = masked_simple / row_sums
print(masked_simple_norm)
print(50*"-")


# triu upper triangular matrix
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
print(mask)
print(50*"-")
masked = attention_scores.masked_fill(mask.bool(), -torch.inf)
print(masked)
print(50*"-")


attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=1)
print(attn_weights)
print(50*"-")

### Masking additional attention weights with dropout

Dropout in deep learning is a technique where randomly selected hidden layer units are ignored during training, effectively “dropping” them out. This method helps prevent overfitting by ensuring that a model does not become overly reliant on any specific set of hidden layer units. It’s important to emphasize that dropout is only used during training and is disabled afterward.

In [None]:
torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5)    #1
example = torch.ones(6, 6)      #2
print(dropout(example))
print(50*"-")


torch.manual_seed(123)
print(dropout(attn_weights))
print(50*"-")

In [None]:
import torch
from tiny_gpt import TinyCausalAttention

inputs = torch.tensor(
    [
        [0.43, 0.15, 0.89],  # "Your"     (x^1)
        [0.55, 0.87, 0.66],  # "journey"  (x^2)
        [0.57, 0.85, 0.64],  # "starts"   (x^3)
        [0.22, 0.58, 0.33],  # "with"     (x^4)
        [0.77, 0.25, 0.10],  # "one"      (x^5)
        [0.05, 0.80, 0.55],  # "step"     (x^6)
    ]
)



# Batch technically is the output actually like input and targets from dataloader... this is just a test class

batch = torch.stack((inputs, inputs), dim=0)
print(batch.shape)  

d_in = inputs.shape[1]  
d_out = 2        

torch.manual_seed(123)
context_length = batch.shape[1]
ca = TinyCausalAttention(d_in, d_out, context_length, 0.0)
context_vecs = ca(batch)
print("context_vecs.shape:", context_vecs.shape)

### Multi Head Attention
The term “multi-head” refers to dividing the attention mechanism into multiple “heads,” each operating independently. In this context, a single causal attention module can be considered single-head attention, where there is only one set of attention weights processing the input sequentially.

We will tackle this expansion from causal attention to multi-head attention. First, we will intuitively build a multi-head attention module by stacking multiple CausalAttention modules.

In [None]:
a = torch.tensor([[[[0.2745, 0.6584, 0.2775, 0.8573],    #1
                    [0.8993, 0.0390, 0.9268, 0.7388],
                    [0.7179, 0.7058, 0.9156, 0.4340]],

                   [[0.0772, 0.3565, 0.1479, 0.5331],
                    [0.4066, 0.2318, 0.4545, 0.9737],
                    [0.4606, 0.5159, 0.4220, 0.5786]]]])

print(a.shape)
print(50*"-")


b = a.transpose(2,3)
print(b)
print(50*"-")

print(b.shape)
print(50*"-")


first_head = a[0, 0, :, :]
print(first_head)
print(first_head.shape)
print(50*"-")
first_res = first_head @ first_head.T
print("First head:\n", first_res)

second_head = a[0, 1, :, :]
second_res = second_head @ second_head.T
print("\nSecond head:\\n", second_res)

In [2]:
import torch
from tiny_gpt import TinyMultiHeadAttention
torch.manual_seed(123)

inputs = torch.tensor(
    [
        [0.43, 0.15, 0.89],  # "Your"     (x^1)
        [0.55, 0.87, 0.66],  # "journey"  (x^2)
        [0.57, 0.85, 0.64],  # "starts"   (x^3)
        [0.22, 0.58, 0.33],  # "with"     (x^4)
        [0.77, 0.25, 0.10],  # "one"      (x^5)
        [0.05, 0.80, 0.55],  # "step"     (x^6)
    ]
)



# Batch technically is the output actually like input and targets from dataloader... this is just a test class

batch = torch.stack((inputs, inputs), dim=0)
print(batch.shape)  
print(50*"-")



batch_size, context_length, d_in = batch.shape
d_out = 2
mha = TinyMultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)
context_vecs = mha(batch)
print(context_vecs)
print(50*"-")
print("context_vecs.shape:", context_vecs.shape)
print(50*"-")

torch.Size([2, 6, 3])
--------------------------------------------------
tensor([[[0.3190, 0.4858],
         [0.2943, 0.3897],
         [0.2856, 0.3593],
         [0.2693, 0.3873],
         [0.2639, 0.3928],
         [0.2575, 0.4028]],

        [[0.3190, 0.4858],
         [0.2943, 0.3897],
         [0.2856, 0.3593],
         [0.2693, 0.3873],
         [0.2639, 0.3928],
         [0.2575, 0.4028]]], grad_fn=<ViewBackward0>)
--------------------------------------------------
context_vecs.shape: torch.Size([2, 6, 2])
--------------------------------------------------


For comparison, the smallest GPT-2 model (117 million parameters) has 12 attention heads and a context vector embedding size of 768. The largest GPT-2 model (1.5 billion parameters) has 25 attention heads and a context vector embedding size of 1,600. The embedding sizes of the token inputs and context embeddings are the same in GPT models (d_in = d_out).

### Summary

- Attention mechanisms transform input elements into enhanced context vector representations that incorporate information about all inputs.
- A self-attention mechanism computes the context vector representation as a weighted sum over the inputs.
- In a simplified attention mechanism, the attention weights are computed via dot products.
- A dot product is a concise way of multiplying two vectors element-wise and then summing the products.
- Matrix multiplications, while not strictly required, help us implement computations more efficiently and compactly by replacing nested for loops.
- In self-attention mechanisms used in LLMs, also called scaled-dot product attention, we include trainable weight matrices to compute intermediate transformations of the inputs: queries, values, and keys.
- When working with LLMs that read and generate text from left to right, we add a causal attention mask to prevent the LLM from accessing future tokens.
- In addition to causal attention masks to zero-out attention weights, we can add a dropout mask to reduce overfitting in LLMs.
- The attention modules in transformer-based LLMs involve multiple instances of causal attention, which is called multi-head attention.
- We can create a multi-head attention module by stacking multiple instances of causal attention modules.
- A more efficient way of creating multi-head attention modules involves batched matrix multiplications.
