### Below is a concise yet friendly explanation of the attention mechanism for assignment introduction:

#### Attention Mechanism (Adapted from “Attention Is All You Need”)

The attention mechanism, introduced in Attention Is All You Need (Vaswani et al., 2017), processes inputs represented as vectors (each row is a token embedding of dimension $𝐷$). We compute three sets of vectors: queries (Q), keys (K), and values (V). The core steps are:

1. **Linear Transformations:**  

Let $X$ be the input matrix, where each of the rows corresponds to a token in the input sequence, and each row is a $d$-dimensional embedding vector.

To compute attention, we first project $X$ into three different representations using learned weight matrices:

Each input vector is transformed into $Q$, $K$, and $V$ using learnable weights.

$$
\begin{aligned}
Q_i &= X W_i^Q, \\
K_i &= X W_i^K, \\
V_i &= X W_i^V.
\end{aligned}
$$

Each head \(i\) has its own learnable parameters $W_i^Q$, $W_i^K$, and $W_i^V$, which transform the input into queries, keys, and values, respectively.


2. **Scaled Dot-Product Attention:**  

\begin{equation}
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.
\end{equation}

Here, $QK^T$ produces a matrix of scores that measures how relevant each “query” position is to every “key” position. $d_k$ is the dimension of queries and keys. 

The softmax function converts these scores into attention weights (non-negative values that sum to 1 across each row).

These weights are then used to combine the values 𝑉 to produce the final output.




3. **Multi-Head Attention:**  

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)
$$

$$
\text{where } \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V).
$$


Multiple attention heads allow the model to attend to different aspects of the input simultaneously. For each of these heads we use $d_k = d_{model}/H$. Their outputs are concatenated and linearly transformed to produce the final result.



*Note:* In this assignment, you are only required to experiment with the provided $Q$, $K$, and $V$ matrices to perform the matrix multiplication

## Self-attention Computer Assignment 


Implement the multi-head self-attention operation, taking in a set of $N$ vectors of $D$ dimensions and outputting a matrix of the same size. Do this without relying on neural network libraries, but rather write directly the required operations in NumPy. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Data size
N = 5
D = 6

X = [[ 0.7, -0.8, -1.2,  -1.,  -0., -0.3],
     [ 2.7,  0.1,  1.6,  1.8,  1.5,  0.3],
     [ 0.1,  2.6, -0.1, -1.3, -0.5, -0.7],
     [ 1.1,  1.5,   1., -0.5,  0.4,  0.4],
     [-0.7, -0.7,  0.7, -1.5, -0.8,  1. ]]

Wq = [[-1.7,  1.6,  0.9, -0.5,  0.4,  -1.],
      [-0.4,  1. , -0.3,  1. ,  0.5,  1.1],
      [ 0.4, -0.9,  -1.,  0.5, -1.4,  0. ],
      [ 0.3,  1.4, -1.2,  0.2,  0.1,  1.6],
      [-0.8,  0.8, -0.7, -1.3,  0.3,  0.8],
      [ 1.1,  0.3, -1.5, -2.3,  2.2, -0.7]]

Wk = [[ 0.3, -0.4, -1.3,  0.3, -1.7,  1.1],
      [-2.3, -1.1,  0.6, -1.2,  2.2,  0.3],
      [ 1.1, -0.4, -0.5,  1.9, -1.1, -1.2],
      [-0.4,  1. , -1.7,  0. , -3.3, -1.4],
      [-0.9, -1.1, -1. ,  1.4,  1.3,  1.2],
      [-0.7,  0.4,  0.4, -1.4, -0.2, -0.5]]

Wv = [[-0.1,  0.7,  1. , -0.1,  1.6,  0.9],
      [ 0.4, -1. , -0.7, -0.6, -0.9, -0.1],
      [-0.4,  0.5, -1.4,  0.1,  0.6,  0.4],
      [ 1.4, -1.3, -1.3, -0.6,  1.6, -0.2],
      [-0.4, -0.6, -1.4, -1. ,  0.4, -0.8],
      [ 0.2,  0.5,  0.4, -0.5,  1.4,  2.3]]



X = np.array(X)
Wq = np.array(Wq)
Wk = np.array(Wk)
Wv = np.array(Wv)

### (a) Implement the self-attention operation

In [None]:
def self_attention(X, Wq, Wk, Wv):
    ...    
    return output, attention_weights
 

In [None]:
# Compute the output
output, attention_weights = self_attention(X, Wq, Wk, Wv)

# Print in a nice format
np.set_printoptions(precision=1)
print("Self-Attention Output:\n", output)
print("Self-Attention Matrix:\n", attention_weights)

### (b) Implement multi-head attention, using the previously implemented function

In [None]:
def multi_head_attention(X, Wq, Wk, Wv, H):
    ...
    return output, attention_weights


In [None]:
# Compute multi-head attention
H = 3
attention_output = multi_head_attention(X, Wq, Wk, Wv, H)
# Again print the requested results

### (c+d) Provide the answers/explanations requested in the problem sheet:
1. Why the results are different?
2. What happens if you change the order of two inputs=