# Demonstrating Masked Self-Attention in Transformers
## This notebook will provide an intuitive and practical demonstration of Masked Self-Attention in Transformers using PyTorch.

## Masked Self Attention

$$
\text{self attention} = softmax\bigg(\frac{Q.K^T}{\sqrt{d_k}}+M\bigg)
$$

$$
\text{new V} = \text{self attention}.V
$$

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
import random
random.seed(24)  # Python random seed
torch.manual_seed(24)  # PyTorch seed (CPU)

In [None]:
# Set print options: No scientific notation, 2 decimal places
torch.set_printoptions(sci_mode=False, precision=4)

# Define the maximum sequence length and the embedding dimension for a model:

max_sequence_length = 5: Specifies the maximum number of tokens a sequence can have. If a sequence is shorter, it may be padded; if longer, it may be truncated.

d_model = 8: Defines the size of each token’s embedding vector, meaning each token will be represented as a     8-dimensional vector.

In [None]:
d_model = 8
max_sequence_length = 5

# Define three linear layers using nn.Linear in PyTorch:

w_query: Projects input embeddings into query space.

w_key: Projects input embeddings into key space.

w_value: Projects input embeddings into value space.
## These linear layers transform input embeddings (d_model dimensional) into new representations of the same size (d_model → d_model)

In [None]:
w_query = nn.Linear(d_model, d_model)
w_key   = nn.Linear(d_model, d_model)
w_value = nn.Linear(d_model, d_model)

# Create a tensor tokens with random values, scaled by a factor of 10.0, to simulate a sequence of token embeddings.

Use torch.randn() to generate a random tensor of shape (max_sequence_length, d_model), where:

max_sequence_length represents the number of tokens in the sequence.

d_model represents the embedding dimension.

Multiply the generated tensor by 10.0 to scale the values.

In [None]:
tokens = torch.randn(max_sequence_length, d_model) * 10.0

In [None]:
tokens.shape

In [None]:
tokens

# Apply linear transformations to the tokens tensor using w_query, w_key, and w_value to obtain query (q), key (k), and value (v) representations.

## Pass tokens through the three linear layers to compute q, k, and v.

In [None]:
q = w_query(tokens)
k = w_key(tokens)
v = w_value(tokens)

In [None]:
q.shape, k.shape, v.shape

## Masked Self Attention

$$
\text{self attention} = softmax\bigg(\frac{Q.K^T}{\sqrt{d_k}}+M\bigg)
$$

$$
\text{new V} = \text{self attention}.V
$$

In [None]:
attn_scores = torch.matmul(q, k.T) / torch.sqrt(torch.tensor(d_model, dtype=torch.float))

In [None]:
attn_scores

In [None]:
attn_scores.shape

## Masking

- This is to ensure words don't get context from words generated in the future.
- Not required in the encoders, but required in the decoders

# Create a lower triangular mask using torch.tril, which generates a matrix where only the lower triangle (including the diagonal) contains ones, while the upper triangle contains zeros. This mask is typically used in masked self-attention in transformers to ensure that each position in a sequence can only attend to previous positions and itself, preventing access to future tokens during decoding.

# Apply a mask to the attention scores using masked_fill, setting positions where mask == 0 to -inf. This ensures that future tokens are ignored in masked self-attention, preventing the model from attending to unseen tokens during autoregressive decoding.

In [None]:
attn_weights = F.softmax(attn_scores, dim=-1)

In [None]:
attn_weights

In [None]:
print(f"sum = {attn_weights.sum(dim=-1)}")

# Compute the weighted sum of value (v) vectors using attention weights, where each query token receives a context-aware representation. This ensures that each generated token attends to relevant past tokens, influencing its prediction based on learned dependencies.

In [None]:
attention_output = torch.matmul(attn_weights, v)

In [None]:
attention_output.shape

In [None]:
attention_output