# Mathematical Foundations of Transformer Attention
This notebook is a focused, incremental build-up of the math powering Transformer self-attention. Each concept is introduced *only* in the context of why it matters for attention. You alternate between a concise explanation and a small code probe so the abstractions stay grounded.

## 1) Core linear algebra primitives (why they matter for attention)
- **Vectors**: token embeddings; geometry encodes semantic relationships.
  - Similar direction ⇒ similar meaning; length can encode confidence or frequency.
- **Dot product**: fast similarity proxy.  $x\cdot y = \|x\|\,\|y\|\cos\theta$.
  - Used to score how much a query should look at a key.
- **Cosine similarity**: normalized dot product removes magnitude: $\cos\theta = \frac{x\cdot y}{\|x\|\|y\|}$.
- **Matrices**: batch many vectors so we can compare all queries with all keys efficiently.
  - $QK^\top$ forms an $n\times n$ grid of raw attention logits.
- **Softmax**: turns arbitrary real scores into a probability distribution per query row.
  - We subtract the row max for numerical stability.
  - Masking sets invalid positions to $-\infty$ so softmax gives them probability 0.
- **Scaling**: divide by $\sqrt{d_k}$ so logits stay in a trainable range as dimensionality grows.

In [None]:
# Vector and dot product demo
import numpy as np
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
dot = np.dot(x, y)
norm_x = np.linalg.norm(x)
norm_y = np.linalg.norm(y)
cos_sim = dot / (norm_x * norm_y)
print(f'Dot product: {dot}')
print(f'Cosine similarity: {cos_sim:.3f}')

**Explanation:**
- The dot product measures how aligned two vectors are.
- Cosine similarity normalizes by length, so only direction matters.
- In Transformers, these operations help compare token representations.

In [None]:
# Softmax normalization demo
def softmax(x, temperature=1.0):
    x = x / temperature
    x = x - np.max(x)  # for numerical stability
    exp_x = np.exp(x)
    return exp_x / np.sum(exp_x)
scores = np.array([2.0, 1.0, 0.1])
print('Softmax:', softmax(scores))
print('Softmax (high temp):', softmax(scores, temperature=2.0))

**Explanation:**
- Softmax converts scores to probabilities.
- Subtracting the max score prevents overflow.
- Temperature controls how peaked or flat the distribution is.

## 2) The self-attention pipeline (compact view)
1. Linear projections: input embeddings $X \in \mathbb{R}^{n\times d_{model}}$ map to Queries $Q$, Keys $K$, Values $V$.
2. Similarity scores: $S = QK^\top / \sqrt{d_k}$ (scale controls variance).
3. Mask (causal/padding): set disallowed logits to $-\infty$.
4. Normalize: $A = \mathrm{softmax}(S)$ (row-wise).
5. Aggregate: $Z = A V$.
6. (Multi-head): split dimensions, repeat independently, then concatenate and mix with $W^O$.
Shapes (single head):
- $Q,K,V: n\times d_k$ (or $d_v$ for $V$).
- $S: n\times n$
- $A: n\times n$ (rows sum to 1).
- $Z: n\times d_v$ → projected back to $n\times d_{model}$.

# Atomic Math Concepts for Transformer Attention

Let's build up from the most basic building blocks, with each concept explained and demonstrated in code.

## 1. Scalars
A scalar is a single number. Scalars are used for scaling, shifting, and as parameters in formulas.

**Example:** $a = 5$

**Explanation:**
- Scalars are the simplest objects in math. In deep learning, they often represent weights, biases, or single values like temperature in softmax.

In [None]:
# Scalar example
scalar = 5
print('Scalar:', scalar)

## 2. Vectors
A vector is an ordered list of numbers. Vectors represent points or directions in space.

**Example:** $\mathbf{x} = [1, 2, 3]$

**Explanation:**
- Vectors are used to represent tokens in NLP models. Each token is mapped to a vector in a high-dimensional space.

In [None]:
# Vector example
import numpy as np
vector = np.array([1, 2, 3])
print('Vector:', vector)

## 3. Dot Product
The dot product measures how much two vectors point in the same direction.

**Formula:** $\mathbf{x} \cdot \mathbf{y} = \sum_i x_i y_i$

**Example:** $[1, 2, 3] \cdot [4, 5, 6] = 1\times4 + 2\times5 + 3\times6 = 32$

**Explanation:**
- The dot product is high when vectors are aligned. In Transformers, it measures similarity between tokens.

In [None]:
# Dot product example
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
dot = np.dot(x, y)
print('Dot product:', dot)

## 4. Norm (Length) of a Vector
The norm is the length of a vector.

**Formula:** $\|\mathbf{x}\| = \sqrt{\sum_i x_i^2}$

**Example:** $\|[1, 2, 3]\| = \sqrt{1^2 + 2^2 + 3^2} = \sqrt{14}$

**Explanation:**
- Norms are used to normalize vectors, which is important for cosine similarity and stable training.

In [None]:
# Norm example
norm_x = np.linalg.norm(x)
print('Norm of x:', norm_x)

## 5. Cosine Similarity
Cosine similarity measures the angle between two vectors, ignoring their length.

**Formula:** $\cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|}$

**Explanation:**
- Cosine similarity is used in attention to compare token meanings, independent of magnitude.

In [None]:
# Cosine similarity example
norm_y = np.linalg.norm(y)
cos_sim = dot / (norm_x * norm_y)
print('Cosine similarity:', cos_sim)

## 6. Matrices
A matrix is a grid of numbers. Matrices can represent collections of vectors or transformations.

**Example:** $A = \begin{bmatrix}1 & 2 \\ 3 & 4\end{bmatrix}$

**Explanation:**
- In Transformers, matrices bundle token vectors and perform batch operations efficiently.

In [None]:
# Matrix example
A = np.array([[1, 2], [3, 4]])
print('Matrix A:\n', A)

## 7. Matrix Multiplication
Matrix multiplication combines two matrices to produce a new matrix.

**Formula:** $(AB)_{ij} = \sum_k A_{ik} B_{kj}$

**Explanation:**
- Used to compute all pairwise similarities between queries and keys in attention.

In [None]:
# Matrix multiplication example
B = np.array([[2, 0], [1, 2]])
product = np.dot(A, B)
print('Matrix product AB:\n', product)

## 8. Softmax Function
Softmax converts a list of scores into probabilities.

**Formula:** $\mathrm{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$

**Explanation:**
- Softmax is used in attention to turn similarity scores into weights for mixing information.

In [None]:
# Softmax example
def softmax(x):
    x = x - np.max(x)
    exp_x = np.exp(x)
    return exp_x / np.sum(exp_x)
scores = np.array([2.0, 1.0, 0.1])
print('Softmax:', softmax(scores))

## 9. From primitives to a minimal attention head
Now that we have: dot products (similarity), scaling (stability), softmax (distribution), and matrix multiply (batching), we can assemble a minimal self-attention head for a toy sequence.

In [None]:
# Minimal self-attention head (single head, no masking)
import numpy as np
np.random.seed(0)

# Toy sequence: 4 tokens, model dim 6
n = 4
d_model = 6
d_k = d_v = 6
X = np.random.randn(n, d_model)

# Learned projection matrices (random init for demo)
W_Q = np.random.randn(d_model, d_k)
W_K = np.random.randn(d_model, d_k)
W_V = np.random.randn(d_model, d_v)

Q = X @ W_Q  # shape (n, d_k)
K = X @ W_K  # shape (n, d_k)
V = X @ W_V  # shape (n, d_v)

# Similarity scores
S = Q @ K.T / np.sqrt(d_k)  # (n, n)

# Softmax row-wise
S_shift = S - S.max(axis=1, keepdims=True)
A = np.exp(S_shift)
A /= A.sum(axis=1, keepdims=True)

Z = A @ V  # (n, d_v)
print('Attention weights (rows sum to 1):')
print(np.round(A, 3))
print('\nOutput representations Z:')
print(np.round(Z, 3))