# Daily Challenge: Simplified Self-Attention Explained

Task
1. Simplified self-attention

We implement a simplified variant of self-attention, free from any trainable waights. the goal of this section to illustrate a few key consetps in self attention before adding trainable weights.

 - Load Input Tensor (Word Embeddings):
    - Start with numerical representations of words (embeddings) because neural networks process numbers. This is the input data our self-attention mechanism will work on.


In [15]:
import torch

inputs = torch.tensor(
[
    [0.43, 0.15, 0.89], # your
    [0.55, 0.87, 0.66], # journey
    [0.57, 0.85, 0.64], # starts
    [0.22, 0.58, 0.33], # with
    [0.77, 0.25, 0.10], # one
    [0.05, 0.80, 0.55] # step
]
)


- Select a Query Vector:
  - In self-attention, we compare each word (vector) against others to understand their relationships. The “query” is the word we’re currently focusing on.
- 1.1 Computing Attention Weights for Inputs[2]:
  - 1.1.1 Attention Score:
      The dot product measures how similar two vectors are. Higher scores indicate greater similarity. We’re finding how relevant each word is to our “query” word.


In [16]:
query = inputs[2]
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
  print(x_i)
  attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2)

tensor([0.4300, 0.1500, 0.8900])
tensor([0.5500, 0.8700, 0.6600])
tensor([0.5700, 0.8500, 0.6400])
tensor([0.2200, 0.5800, 0.3300])
tensor([0.7700, 0.2500, 0.1000])
tensor([0.0500, 0.8000, 0.5500])
tensor([0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605])


- 1.1.2 Attention Weights:

  - Softmax transforms the scores into probabilities (attention weights). These weights represent how much “attention” each word should receive when we create the context vector.



In [17]:
import torch.nn.functional as F
attn_weights_2 = F.softmax(attn_scores_2, dim = 0)
print(attn_weights_2)

tensor([0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565])


- 1.1.3 Context Vector:

    The context vector is a weighted sum of the input vectors. It represents a refined version of the query, incorporating information from other relevant words.



In [20]:
context_vector_2 = torch.sum(attn_weights_2.unsqueeze(1) * inputs, dim=0)
context_vector_2

tensor([0.4431, 0.6496, 0.5671])

- 1.2 Computing Attention Weights for All Inputs:

    - 1.2.1 Attention Score:
        - Extend the process to compute attention scores for every word against every other word in the sequence. This creates a matrix of relationships.



In [21]:
attn_scores = inputs @ inputs.T

In [22]:
attn_scores

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])

- 1.2.2 Attention Weights:

  - Apply softmax across rows to get attention weights for each word, showing its relationship to all others.



In [23]:
attn_weights = F.softmax(attn_scores, dim=1)
attn_weights

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])

- 1.2.3 All Context Vector:
   - Generate a context vector for each word, capturing its meaning in the context of the entire sequence.



In [26]:
context_vectors = attn_weights @ inputs
context_vectors

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])

2. The ‘Self’ in Self-Attention

In self-attention, the ‘self’ refers to the mechanism’s ability to computer attention weights by relating different positions within a single input sequence.

- 2.1 Weights Parameters vs Attention Weights

In the weight matrices W, the term ‘weight’ is short for ‘weight parameters’, the values of a neural network that are optimized during training. This is not to be confused with the attention weights.

As I already saw in the previous section, attention weights determine the extent to which a context vector depends on the different parts of input, i.e., to what extent the network focuses on different parts of the input.

In summary, weight parameters are the fundamental, learnt coefficents that defined the networks connection, while attention weights are dynamic, context specific values.

- 2.1 Weights Parameters vs Attention Weights:

  - Distinguish between learnt parameters (weights of the network) and dynamically computed attention weights. This clarifies the different roles they play.



- 2.2 Computing Weight Parameters for Inputs[2]:

    - 2.2.1 Initialize the three weight matrices Wq, Wk, Wv:
        - Introduce learnable weight matrices (Wq, Wk, Wv) to transform input vectors into queries, keys, and values. This adds flexibility and allows the model to learn complex relationships.



In [31]:
embed_dim = 3
hidden_dim = 4
Wq = torch.nn.Parameter(torch.randn(embed_dim, hidden_dim))
Wk = torch.nn.Parameter(torch.randn(embed_dim, hidden_dim))
Wv = torch.nn.Parameter(torch.randn(embed_dim, hidden_dim))

- 2.2.2 Compute the query, key, and value vectors for inputs[1]:

    - These transformations project the input into different “spaces” that emphasize different aspects of the word’s meaning.



In [28]:
x = inputs[1]
q = x @ Wq
k = inputs @ Wk
v = inputs @ Wv

- 2.2.3 Compute the Attention Score inputs[1][1] or ω11:

  - Calculate the similarity between the transformed query and key.



- 2.2.4 Compute all the Attention Scores for inputs[1]:

    - Calculate all the similarity scores against the query vector.

- 2.2.5 Attention weights for inputs[1]:

    - Normalize the attention scores.



In [32]:
scores = q @ k.T
attn_weights_1 = F.softmax(scores, dim=0)
attn_weights_1

tensor([0.2008, 0.1944, 0.1868, 0.1374, 0.0724, 0.2082],
       grad_fn=<SoftmaxBackward0>)

- 2.2.6 Calculate Context vector for inputs[1]:
   - Generate the context vector.



In [33]:
context_vector_1 = attn_weights_1 @ v
context_vector_1

tensor([ 0.7491, -0.7984, -0.1104, -0.2245], grad_fn=<SqueezeBackward4>)

- 2.3 Computing Weight Parameters for All Inputs:

    - 2.3.2 Compute the query, key, and value vectors:
        Compute the transformed vectors for all input words.
    - 2.3.3 Compute the Attention Score for all inputs:
        Compute all attention scores between all words.
    - 2.3.4 Attention weights for all inputs:
        Normalize the attention scores.
    - 2.3.5 Calculate Context vector for all inputs:
        Generate all context vectors.



In [35]:
Q = inputs @ Wq
K = inputs @ Wk
V = inputs @ Wv

In [39]:
scores_all = Q @ K.T
scores_all

tensor([[ 0.9402,  2.3164,  2.2994,  1.3546,  1.3448,  1.5989],
        [-0.2136,  1.7952,  1.8464,  1.0161,  2.2483,  0.5811],
        [-0.1998,  1.7890,  1.8424,  1.0038,  2.2846,  0.5480],
        [-0.3237,  0.7767,  0.8054,  0.4578,  1.0953,  0.2049],
        [ 0.1070,  1.1745,  1.2519,  0.5000,  2.2938, -0.2021],
        [-0.4544,  0.9136,  0.9231,  0.6198,  0.8321,  0.5544]],
       grad_fn=<MmBackward0>)

In [40]:
weights_all = F.softmax(scores_all, dim=1)
weights_all

tensor([[0.0725, 0.2870, 0.2822, 0.1097, 0.1086, 0.1400],
        [0.0297, 0.2214, 0.2331, 0.1016, 0.3484, 0.0658],
        [0.0299, 0.2184, 0.2304, 0.0996, 0.3585, 0.0631],
        [0.0662, 0.1989, 0.2047, 0.1446, 0.2735, 0.1123],
        [0.0550, 0.1600, 0.1729, 0.0815, 0.4901, 0.0404],
        [0.0550, 0.2160, 0.2181, 0.1610, 0.1991, 0.1508]],
       grad_fn=<SoftmaxBackward0>)

In [41]:
context_all = weights_all @ V
context_all

tensor([[-1.4007,  1.8906, -1.5530,  1.6846],
        [-1.1126,  1.7207, -1.2679,  1.7130],
        [-1.1010,  1.7136, -1.2554,  1.7134],
        [-1.1643,  1.7008, -1.3060,  1.6205],
        [-0.9365,  1.5894, -1.0687,  1.6790],
        [-1.2475,  1.7444, -1.4000,  1.6056]], grad_fn=<MmBackward0>)