## What is attention mechanism

*   Words represented as vectors don't consider context (bat (baseball), bat (animal) are the same vector
*  Attention mechanisms changes the original vector with maybe multiple meanings by considering it's context. SUPA interesting.
* Even more, when the weights are applied, a vector for 'bat'  (the animal) can be shifted into a direction of maybe caves, blindness, etc..



### Omega -> Context Vector

### Self-Attention Math Steps

Given a sequence of token embeddings \( x_1, x_2, \dots, x_n \), we compute:

1. Assume weights would be calculated through training, seeing how close the prediction was to the next word.

$$
Q_i = x_i W^Q, \quad K_j = x_j W^K, \quad V_j = x_j W^V
$$

2. **Raw attention scores (ω)** using dot product between Query and Key:

$$
\omega_{ij} = Q_i \cdot K_j^T
$$

3. **Normalize with softmax** to get attention weights:

$$
\alpha_{ij} = \text{softmax}(\omega_{ij})
$$

4. **Compute context vector \( z_i \)** (a weighted sum of Value vectors):

$$
z_i = \sum_{j=1}^{n} \alpha_{ij} V_j
$$



In [None]:
import torch

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

### Simple Calculation of Attention Weights

Magnitude of  $
\vec{a} \cdot \vec{b} = \|\vec{a}\| \|\vec{b}\| \cos(\theta)
$ determines how aligned two vectors are, thus a mathematical representation of context when words are represented in space.  After we calculate the dot product of the query vector (the current vector we are at) with the vectors of every other word in the context, we create this score

$$
\vec{\alpha_{2j}} =
\begin{bmatrix}
\varnothing \\
\varnothing \\
\varnothing \\
\varnothing \\
\varnothing \\
\varnothing
\end{bmatrix}
\rightarrow
\begin{bmatrix}
0.9544 \\
1.4950 \\
1.4754 \\
0.8434 \\
0.7070 \\
1.0865
\end{bmatrix} = \sum_{j=1}^{n} \alpha_{2j}
$$




In [None]:
query = inputs[1]

#start an empty vector for the attention score, with the specific size of the
#sentence
print(inputs.shape[0])
attention_score_x2 = torch.empty(inputs.shape[0])
for (index, x_i) in enumerate(inputs):
  attention_score_x2[index] = torch.dot(x_i, query)

print(attention_score_x2)

6
tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


### Normalizing

These sum to one through a normailization process that, for lack of better words, yields 'meaningful' results. Mathematically, for later implementation. But in general just use the pytorch one -- it's robust.

$$$$

$$
\vec{z_2} =
\begin{bmatrix}
\varnothing \\
\varnothing \\
\varnothing \\
\varnothing \\
\varnothing \\
\varnothing
\end{bmatrix}
\rightarrow
\begin{bmatrix}
0.9544 \\
1.4950 \\
1.4754 \\
0.8434 \\
0.7070 \\
1.0865
\end{bmatrix}
\rightarrow
\text{softmax}(x_i)
\rightarrow
\begin{bmatrix}
0.1385 \\
0.2379 \\
0.2333 \\
0.1240 \\
0.1082 \\
0.1581
\end{bmatrix}
$$


$$$$

where


$$
\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}
$$


In [None]:


def softmax_naive(x):
 return torch.exp(x) / torch.exp(x).sum(dim=0) #dim=0 because we're summing a column.


attn_weights_2_solid = torch.softmax(attention_score_x2, dim=0)
print("Refined Attention Weights:", attn_weights_2_solid)


Refined Attention Weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])


### Finalizing the Calculation of $z$
$$
\vec{\alpha_{2j}} =
\begin{bmatrix}
\varnothing \\
\varnothing \\
\varnothing \\
\varnothing \\
\varnothing \\
\varnothing
\end{bmatrix}
\rightarrow
\begin{bmatrix}
0.9544 \\
1.4950 \\
1.4754 \\
0.8434 \\
0.7070 \\
1.0865
\end{bmatrix}
\rightarrow
\text{softmax}(x_i)
\rightarrow
\begin{bmatrix}
0.1385 \\
0.2379 \\
0.2333 \\
0.1240 \\
0.1082 \\
0.1581
\end{bmatrix}
$$

$$
\space
$$

$$
\vec{z_{2}} =
\begin{bmatrix}
\varnothing \\
\varnothing \\
\varnothing \\
\varnothing \\
\varnothing \\
\varnothing
\end{bmatrix}
\rightarrow
\alpha_{21} V_1 + \alpha_{22} V_2 + \cdots
\rightarrow
\begin{bmatrix}
0.4371 \\
0.4371 \\
0.4371 \\
0.4371 \\
0.4371 \\
0.4371
\end{bmatrix} = z_2 =\sum_j \alpha_{2j}V_j
$$



In [None]:
context_vector = torch.empty(inputs.shape[0])

for (index, x_j) in enumerate(query):
  context_vector += attn_weights_2_solid[index] * x_j

print(context_vector)


tensor([0.4371, 0.4371, 0.4371, 0.4371, 0.4371, 0.4371])


## Computing Attention Weights for all Input Tokens (3.3.2)