<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Components/attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Components of Neural Networks

## Self-Attention

Self-attention is a mechanism that allows each position in a sequence to attend to all other positions. It starts with an input sequence $X \in \mathbb{R}^{n \times d}$, where $n$ is the number of tokens and $d$ is the embedding dimension. The matrices $Q$, $K$, and $V$ are computed as follows:

$$Q = XW_Q \in \mathbb{R}^{n \times k}$$
$$K = XW_K \in \mathbb{R}^{n \times k}$$
$$V = XW_V$$

where $W_Q \in \mathbb{R}^{d \times k}$, $W_K \in \mathbb{R}^{d \times k}$, and $W_V \in \mathbb{R}^{d \times v}$ are learned weight matrices. 

The attention mechanism is defined mathematically as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{k}}\right)V$$

Where the product $QK^T \in \mathbb{R}^{n \times n}$ gives us the raw attention scores, which are then scaled by $\sqrt{k}$, passed through a softmax function to get the attention weights, and finally multiplied by $V$ to get the output.

```mermaid
graph TB
    X[Input Sequence X] -->|WQ| Q[Q Matrix]
    X -->|WK| K[K Matrix]
    X -->|WV| V[V Matrix]
    Q & K --> |QK^T| S[Scaled Attention]
    S --> |softmax| W[Attention Weights]
    W & V --> O[Output]
```

### Self-Attention with Example

Consider an input sequence $X$ represented as a $3 \times 4$ matrix:

$$X = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 1 & 1 & 1 \end{bmatrix} \in \mathbb{R}^{3 \times 4}$$

Assume the weight matrices $W_Q$, $W_K$, and $W_V$ are given as:

$$W_Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} \in \mathbb{R}^{4 \times 2}, \quad W_K = \begin{bmatrix} 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ 1 & 0 \end{bmatrix} \in \mathbb{R}^{4 \times 2}, \quad W_V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} \in \mathbb{R}^{4 \times 2}$$

Then, the matrices $Q$, $K$, and $V$ are computed as:

$$Q = XW_Q = \begin{bmatrix} 2 & 0 \\ 0 & 2 \\ 2 & 2 \end{bmatrix} \in \mathbb{R}^{3 \times 2}, \quad K = XW_K = \begin{bmatrix} 0 & 2 \\ 2 & 0 \\ 2 & 2 \end{bmatrix} \in \mathbb{R}^{3 \times 2}, \quad V = XW_V = \begin{bmatrix} 2 & 0 \\ 0 & 2 \\ 2 & 2 \end{bmatrix} \in \mathbb{R}^{3 \times 2}$$

The attention scores are computed as:


Where $d_k$ is the dimension of the key vectors (in this case, 2).

### Computation of Attention Scores

Using the matrices $Q$, $K$, and $V$ computed above, we can calculate the attention scores as follows:

$$QK^T = \begin{bmatrix} 2 & 0 \\ 0 & 2 \\ 2 & 2 \end{bmatrix} \begin{bmatrix} 0 & 2 \\ 2 & 0 \\ 2 & 2 \end{bmatrix}^T = \begin{bmatrix} 0 & 4 & 4 \\ 4 & 0 & 4 \\ 4 & 4 & 8 \end{bmatrix}$$

Next, we scale the scores by $\sqrt{d_k}$ (which is $\sqrt{2}$):

$$\frac{QK^T}{\sqrt{d_k}} = \begin{bmatrix} 0 & 4 & 4 \\ 4 & 0 & 4 \\ 4 & 4 & 8 \end{bmatrix} \div \sqrt{2} = \begin{bmatrix} 0 & 2.83 & 2.83 \\ 2.83 & 0 & 2.83 \\ 2.83 & 2.83 & 5.66 \end{bmatrix}$$

Applying the softmax function to each row, we get the attention weights:

$$\text{Attention}(Q, K, V) = \begin{bmatrix} 0.106 & 0.447 & 0.447 \\ 0.447 & 0.106 & 0.447 \\ 0.211 & 0.211 & 0.577 \end{bmatrix} \begin{bmatrix} 2 & 0 \\ 0 & 2 \\ 2 & 2 \end{bmatrix} = \begin{bmatrix} 0.894 & 1.788 \\ 0.894 & 1.788 \\ 1.154 & 2.309 \end{bmatrix}$$

In [9]:
import numpy as np
import plotly.graph_objects as go

# Define matrices
X = np.array([[1, 0, 1, 0], [0, 1, 0, 1], [1, 1, 1, 1]])
V = np.array([[2, 0], [0, 2], [2, 2]])
Attention = np.array([[0.894, 1.788], [0.894, 1.788], [1.154, 2.309]])

# Create 3D scatter plot
fig = go.Figure()

# Plot X
fig.add_trace(go.Scatter3d(x=X[:, 0], y=X[:, 1], z=X[:, 2], mode='markers+text', name='X', text=['X1', 'X2', 'X3'], textposition='top center'))

# Plot V
fig.add_trace(go.Scatter3d(x=V[:, 0], y=V[:, 1], z=[0, 0, 0], mode='markers+text', name='V', text=['V1', 'V2', 'V3'], textposition='top center'))

# Plot Attention
fig.add_trace(go.Scatter3d(x=Attention[:, 0], y=Attention[:, 1], z=[0, 0, 0], mode='markers+text', name='Attention', text=['A1', 'A2', 'A3'], textposition='top center'))

# Update layout
fig.update_layout(scene=dict(xaxis_title='X', yaxis_title='Y', zaxis_title='Z'), title='3D Plot of X, V, and Attention')

fig.show()

## Cross-Attention

### Cross-Attention Mechanism

Cross-attention computes attention between two different sequences. Given a query sequence $X_q \in \mathbb{R}^{m \times d}$ and a key-value sequence $X_{kv} \in \mathbb{R}^{n \times d}$ (where $m$ and $n$ may be different), the matrices are computed as:

$$Q = X_qW_Q \in \mathbb{R}^{m \times k}$$
$$K = X_{kv}W_K \in \mathbb{R}^{n \times k}$$
$$V = X_{kv}W_V \in \mathbb{R}^{n \times v}$$

where $W_Q \in \mathbb{R}^{d \times k}$, $W_K \in \mathbb{R}^{d \times k}$, and $W_V \in \mathbb{R}^{d \times v}$ are weight matrices.

The cross-attention output is computed as:

$$\text{CrossAttention}(Q, K, V) = \text{softmax}_{\text{row}}\left(\frac{QK^T}{\sqrt{k}}\right)V \in \mathbb{R}^{m \times v}$$

Note that the attention weight matrix is of size $m \times n$, reflecting that each position in the query sequence attends to all positions in the key-value sequence.

```mermaid
graph TB
    Xq[Query Sequence] -->|WQ| Q[Q Matrix]
    Xkv[Key-Value Sequence] -->|WK| K[K Matrix]
    Xkv -->|WV| V[V Matrix]
    Q & K --> |QK^T| S[Scaled Attention]
    S --> |softmax| W[Attention Weights]
    W & V --> O[Output]
```

### Cross-Attention Example

Consider a query sequence $X_q$ and a key-value sequence $X_{kv}$:

$$X_q = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \end{bmatrix} \in \mathbb{R}^{2 \times 4}$$

$$X_{kv} = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 1 & 1 & 1 \end{bmatrix} \in \mathbb{R}^{3 \times 4}$$

With weight matrices:

$$W_Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} \in \mathbb{R}^{4 \times 2}, \quad W_K = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} \in \mathbb{R}^{4 \times 2}, \quad W_V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} \in \mathbb{R}^{4 \times 2}$$

The computation proceeds as follows:

1. Computing Q, K, and V:
$$Q = X_qW_Q = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} \in \mathbb{R}^{2 \times 2}$$
$$K = X_{kv}W_K = \begin{bmatrix} 2 & 0 \\ 0 & 2 \\ 2 & 2 \end{bmatrix} \in \mathbb{R}^{3 \times 2}$$
$$V = X_{kv}W_V = \begin{bmatrix} 2 & 0 \\ 0 & 2 \\ 2 & 2 \end{bmatrix} \in \mathbb{R}^{3 \times 2}$$

2. Computing attention scores:
$$QK^T = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} \begin{bmatrix} 2 & 0 & 2 \\ 0 & 2 & 2 \end{bmatrix}^T = \begin{bmatrix} 4 & 0 & 4 \\ 0 & 4 & 4 \end{bmatrix} \in \mathbb{R}^{2 \times 3}$$

3. Scaling by $\sqrt{k}$ (where $k = 2$):
$$\frac{QK^T}{\sqrt{k}} = \frac{1}{\sqrt{2}} \begin{bmatrix} 4 & 0 & 4 \\ 0 & 4 & 4 \end{bmatrix} = \begin{bmatrix} 2.83 & 0 & 2.83 \\ 0 & 2.83 & 2.83 \end{bmatrix} \in \mathbb{R}^{2 \times 3}$$

4. Applying softmax to obtain the attention weight matrix:
$$A = \text{softmax}_{\text{row}}\left(\frac{QK^T}{\sqrt{k}}\right) = \begin{bmatrix} 0.5 & 0 & 0.5 \\ 0 & 0.5 & 0.5 \end{bmatrix} \in \mathbb{R}^{2 \times 3}$$

5. Computing the cross-attention output:
$$\text{Cross-Attention Output} = AV = \begin{bmatrix} 0.5 & 0 & 0.5 \\ 0 & 0.5 & 0.5 \end{bmatrix} \begin{bmatrix} 2 & 0 \\ 0 & 2 \\ 2 & 2 \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} \in \mathbb{R}^{2 \times 2}$$

In [10]:
# Define matrices for cross-attention
X_q = np.array([[1, 0, 1, 0], [0, 1, 0, 1]])
X_kv = np.array([[1, 0, 1, 0], [0, 1, 0, 1], [1, 1, 1, 1]])
V_cross = np.array([[2, 0], [0, 2], [2, 2]])
Attention_cross = np.array([[0.894, 1.788], [0.894, 1.788]])

# Create 3D scatter plot for cross-attention
fig = go.Figure()

# Plot X_q
fig.add_trace(go.Scatter3d(x=X_q[:, 0], y=X_q[:, 1], z=[2, 2], mode='markers+text', name='X_q', text=['Xq1', 'Xq2'], textposition='top center'))

# Plot X_kv
fig.add_trace(go.Scatter3d(x=X_kv[:, 0], y=X_kv[:, 1], z=[0, 0, 0], mode='markers+text', name='X_kv', text=['Xkv1', 'Xkv2', 'Xkv3'], textposition='top center'))

# Plot Cross-Attention
fig.add_trace(go.Scatter3d(x=Attention_cross[:, 0], y=Attention_cross[:, 1], z=[1, 1], mode='markers+text', name='Cross-Attention', text=['CA1', 'CA2'], textposition='top center'))

# Update layout
fig.update_layout(scene=dict(xaxis_title='X', yaxis_title='Y', zaxis_title='Z'), title='3D Plot of Cross-Attention Components')

fig.show()

### Key Differences from Self-Attention

1. **Different Input Sequences**: Cross-attention uses two distinct sequences ($X_q$ and $X_{kv}$), while self-attention uses the same sequence for both.

2. **Output Dimensions**: The output shape matches the query sequence length (2) rather than the input sequence length (3) in self-attention.

3. **Attention Pattern**: The attention weights show how each position in $X_q$ attends to positions in $X_{kv}$, creating an asymmetric relationship.

This mechanism is particularly useful in sequence-to-sequence tasks where different-length sequences need to interact, such as in machine translation or question-answering systems.