# Demonstrating Scaled Dot-Product Attention & Self-Attention in Transformers
## This notebook will provide an intuitive and practical demonstration of Scaled Dot-Product Attention and Self-Attention in Transformers using PyTorch.

![Description](Scaled_Dot_Product_Attention.png)

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [2]:
import random
random.seed(42)  # Python random seed
torch.manual_seed(42)  # PyTorch seed (CPU)

<torch._C.Generator at 0x7f691466f970>

In [3]:
# Set print options: No scientific notation, 2 decimal places
torch.set_printoptions(sci_mode=False, precision=4)

# Define the maximum sequence length and the embedding dimension for a model:

max_sequence_length = 10: Specifies the maximum number of tokens a sequence can have. If a sequence is shorter, it may be padded; if longer, it may be truncated.

d_model = 8: Defines the size of each token’s embedding vector, meaning each token will be represented as a     8-dimensional vector.

In [4]:
d_model = 8
max_sequence_length = 10

# Define three linear layers using nn.Linear in PyTorch:

w_query: Projects input embeddings into query space.

w_key: Projects input embeddings into key space.

w_value: Projects input embeddings into value space.
## These linear layers transform input embeddings (d_model dimensional) into new representations of the same size (d_model → d_model)

In [5]:
w_query = nn.Linear(d_model, d_model)
w_key   = nn.Linear(d_model, d_model)
w_value = nn.Linear(d_model, d_model)

# w_query, has a learnable weight matrix of shape (d_model, d_model).

## The .weight attribute stores the trainable parameters of this transformation.

### Display the randomly initialized values of w_query.weight.

In [6]:
w_query.weight # , w_key.weight, w_value.weight

Parameter containing:
tensor([[ 0.2703,  0.2935, -0.0828,  0.3248, -0.0775,  0.0713, -0.1721,  0.2076],
        [ 0.3117, -0.2594,  0.3073,  0.0662,  0.2612,  0.0479,  0.1705, -0.0499],
        [ 0.2725,  0.0523, -0.1651,  0.0901, -0.1629, -0.0415, -0.1436,  0.2345],
        [-0.2791, -0.1630, -0.0998, -0.2126,  0.0334, -0.3492,  0.3193, -0.3003],
        [ 0.2730,  0.0588, -0.1148,  0.2185,  0.0551,  0.2857,  0.0387, -0.1115],
        [ 0.0950, -0.0959,  0.1488,  0.3157,  0.2044, -0.1546,  0.2041,  0.0633],
        [ 0.1795, -0.2155, -0.3500, -0.1366, -0.2712,  0.2901,  0.1018,  0.1464],
        [ 0.1118, -0.0062,  0.2767, -0.2512,  0.0223, -0.2413,  0.1090, -0.1218]],
       requires_grad=True)

# Create a tensor tokens with random values, scaled by a factor of 10.0, to simulate a sequence of token embeddings.

Use torch.randn() to generate a random tensor of shape (max_sequence_length, d_model), where:

max_sequence_length represents the number of tokens in the sequence.

d_model represents the embedding dimension.

Multiply the generated tensor by 10.0 to scale the values.

In [7]:
tokens = torch.randn(max_sequence_length, d_model) * 10.0

# Print the shape of the tensor tokens to confirm its dimensions.

In [8]:
tokens.shape

torch.Size([10, 8])

# Print the tokens tensor to inspect its values

In [9]:
tokens

tensor([[    13.0321,      4.8787,     11.3399,     -3.5556,      3.6183,
             19.9935,      6.6301,      7.0473],
        [     0.2127,     -8.2927,    -10.8086,     -7.8385,      5.0710,
              0.8208,      4.4398,     -7.2403],
        [    -4.6113,     -0.6388,    -13.6673,      3.2982,     -9.8271,
              3.0177,      1.7869,     -1.2931],
        [   -15.7541,     22.5084,     10.0123,     13.6424,      6.3332,
              4.0500,      3.4159,     -2.2136],
        [     1.7290,     10.5136,      0.0749,     -0.7737,      6.4269,
              5.7425,      5.8672,     -0.1885],
        [    -9.1432,     14.8397,     -9.1091,     -5.2910,     -8.0515,
              5.1580,     -7.1288,      2.1962],
        [     5.6351,     18.5822,     10.4407,     -8.6382,      8.3509,
             -3.1571,      2.6911,      0.8540],
        [   -14.1288,    -18.7906,     -1.7983,      7.9039,      5.2394,
             -2.6935,    -16.1906,      0.0126],
        [     8.

# Apply linear transformations to the tokens tensor using w_query, w_key, and w_value to obtain query (q), key (k), and value (v) representations.

## Pass tokens through the three linear layers to compute q, k, and v.

In [10]:
q = w_query(tokens)
k = w_key(tokens)
v = w_value(tokens)

# Print the shapes of q, k, and v to verify their dimensions.

In [11]:
q.shape, k.shape, v.shape

(torch.Size([10, 8]), torch.Size([10, 8]), torch.Size([10, 8]))

## Self Attention

$$
\text{self attention} = softmax\bigg(\frac{Q.K^T}{\sqrt{d_k}}\bigg)
$$

$$
\text{new V} = \text{self attention}.V
$$

# Compute the attention score matrix using torch.matmul(q, k.T) without scaling

In [12]:
torch.matmul(q, k.T)

tensor([[-108.3274,   57.7419,   18.5742,  -58.1449,  -47.4800,  -68.6371,
          -77.5921,   66.9448,  -33.8048,   -2.1428],
        [ -34.7808,   12.9728,   -8.8561,   -5.0074,  -12.5370,    3.0406,
           28.6565,  -22.4715,  -53.5321, -107.4938],
        [ -36.0648,   17.3945,   62.1235,  -34.9770,    0.7812,  106.7213,
            4.0961,  -25.0537,   13.3692,    5.9479],
        [  55.7433,  -23.3669,   33.5451,  -25.7747,   25.1168,  -13.1104,
          -33.6642,  -43.2133,   85.4181,  161.3757],
        [ -13.6088,   20.2927,   24.7524,  -56.7094,  -12.7326,  -14.9126,
          -35.0543,  -10.1397,   15.2518,   15.7781],
        [ 104.5473,  -97.4240,   57.7952,  133.3300,   67.0595,  168.0302,
           56.6170,  -78.5524,   29.1397,  182.7953],
        [ 121.6384,  -55.9330,  -30.9015,   32.2752,   19.1241,  -50.3688,
           -1.5578,  -33.8538,   20.8590,   58.9215],
        [ -19.7858,  -32.7079,  -34.7863,  140.5346,   35.8255,   17.4498,
           59.9764,   

# Observe that the variance of the dot product is significantly larger than the variance of q and k

In [13]:
# Why we need sqrt(d_k) in denominator
q.var(), k.var(), torch.matmul(q, k.T).var()

(tensor(21.7614, grad_fn=<VarBackward0>),
 tensor(31.7007, grad_fn=<VarBackward0>),
 tensor(3491.7310, grad_fn=<VarBackward0>))

# Compute the attention scores by taking the dot product of query (q) and key (k.T) using torch.matmul(q, k.T).

# Normalize the scores by dividing them by the square root of d_model to stabilize the variance and prevent extremely large values.

## Store the result in attn_scores, which will be used for the softmax operation in the attention mechanism.

In [14]:
attn_scores = torch.matmul(q, k.T) / torch.sqrt(torch.tensor(d_model, dtype=torch.float))

# observe how scaling affects variance, compare q.var(), k.var(), and attn_scores.var() after applying the scaling

In [15]:
# Why we need sqrt(d_k) in denominator
q.var(), k.var(), attn_scores.var()

(tensor(21.7614, grad_fn=<VarBackward0>),
 tensor(31.7007, grad_fn=<VarBackward0>),
 tensor(436.4664, grad_fn=<VarBackward0>))

# Print the attention scores 

In [16]:
attn_scores

tensor([[-38.2995,  20.4149,   6.5670, -20.5573, -16.7867, -24.2669, -27.4329,
          23.6686, -11.9518,  -0.7576],
        [-12.2969,   4.5866,  -3.1311,  -1.7704,  -4.4325,   1.0750,  10.1316,
          -7.9449, -18.9265, -38.0048],
        [-12.7508,   6.1499,  21.9640, -12.3662,   0.2762,  37.7317,   1.4482,
          -8.8578,   4.7267,   2.1029],
        [ 19.7082,  -8.2615,  11.8600,  -9.1127,   8.8801,  -4.6352, -11.9021,
         -15.2782,  30.1999,  57.0549],
        [ -4.8114,   7.1746,   8.7513, -20.0498,  -4.5017,  -5.2724, -12.3936,
          -3.5849,   5.3923,   5.5784],
        [ 36.9631, -34.4446,  20.4337,  47.1393,  23.7091,  59.4077,  20.0171,
         -27.7725,  10.3025,  64.6279],
        [ 43.0057, -19.7753, -10.9253,  11.4110,   6.7614, -17.8080,  -0.5508,
         -11.9691,   7.3748,  20.8319],
        [ -6.9953, -11.5640, -12.2988,  49.6865,  12.6662,   6.1695,  21.2048,
          11.3453,  -7.2815,  16.6078],
        [-42.4258,  25.5635,  13.0669, -23.1995,

# Print the shape of the attention scores to understand their dimensions
## The shape of attn_scores should be (max_sequence_length, max_sequence_length), indicating that each token attends to every other token in the sequence.

In [17]:
attn_scores.shape

torch.Size([10, 10])

# Apply the softmax function to the attention scores to obtain attention weights
## This converts the raw attention scores into probabilities, ensuring that the sum of attention weights across each row equals 1.

## Higher values indicate stronger attention to specific tokens.

In [18]:
attn_weights = F.softmax(attn_scores, dim=-1)

# Print the attention weights 

In [19]:
attn_weights

tensor([[    0.0000,     0.0372,     0.0000,     0.0000,     0.0000,     0.0000,
             0.0000,     0.9628,     0.0000,     0.0000],
        [    0.0000,     0.0039,     0.0000,     0.0000,     0.0000,     0.0001,
             0.9960,     0.0000,     0.0000,     0.0000],
        [    0.0000,     0.0000,     0.0000,     0.0000,     0.0000,     1.0000,
             0.0000,     0.0000,     0.0000,     0.0000],
        [    0.0000,     0.0000,     0.0000,     0.0000,     0.0000,     0.0000,
             0.0000,     0.0000,     0.0000,     1.0000],
        [    0.0000,     0.1610,     0.7792,     0.0000,     0.0000,     0.0000,
             0.0000,     0.0000,     0.0271,     0.0326],
        [    0.0000,     0.0000,     0.0000,     0.0000,     0.0000,     0.0054,
             0.0000,     0.0000,     0.0000,     0.9946],
        [    1.0000,     0.0000,     0.0000,     0.0000,     0.0000,     0.0000,
             0.0000,     0.0000,     0.0000,     0.0000],
        [    0.0000,     0.

# Compute the weighted sum of value (v) vectors using the attention weights
Each query now receives a context-aware representation by combining value vectors (v), where the contribution of each value is determined by the attention weights. This helps in focusing on the most relevant parts of the input.

In [20]:
attention_output = torch.matmul(attn_weights, v)

# Print the shape of the attention output to understand its dimensions
The shape of attention_output should match the input shape of v, confirming that each token has been transformed into a new representation based on attention weights.

In [21]:
attention_output.shape

torch.Size([10, 8])

# Print the attention output to observe how the values have been transformed after applying attention weights
This output represents the refined token representations after the attention mechanism has weighted and aggregated the value vectors (v) based on the computed attention scores.

In [22]:
attention_output

tensor([[     2.1261,      1.8769,     -4.6177,      7.3562,     -3.0561,
             -6.4733,    -10.4790,     11.3530],
        [     4.0124,      1.9254,      7.8426,     -3.4303,      6.4231,
              5.0072,      7.5677,     -2.2979],
        [    -6.6414,     -7.1332,      0.0274,     -2.6242,      2.7893,
              1.1445,      2.5857,     -2.2104],
        [     0.3210,     -5.0049,      4.0713,     -3.4348,     -2.5455,
             -2.8200,      1.0738,     -2.9991],
        [    -4.3984,     -4.0713,     -3.9356,      0.6972,     -3.2912,
             -2.2215,     -1.1523,     -2.2966],
        [     0.2835,     -5.0163,      4.0495,     -3.4305,     -2.5169,
             -2.7986,      1.0819,     -2.9949],
        [    -4.4170,      5.4806,      6.9589,     -5.0158,      4.6338,
              4.0796,      5.1147,     -9.5842],
        [     9.7750,     -3.0269,     -7.3841,     -7.0403,      6.6433,
              1.1184,      1.8107,      6.3116],
        [    -2.