## I. The Motivation: From Static to Contextual Embeddings

The foundational requirement for any Natural Language Processing (NLP) application is the ability to represent **words as numbers**.

### 1. Limitations of Static Embeddings
While early techniques like One-Hot Encoding and Bag of Words exist, **Word Embeddings** are highly valued because they successfully convert a word's **semantic meaning** into a numerical vector representation.

However, traditional word embeddings suffer from a critical flaw: they are **static**.

*   **Average Meaning:** The word embedding captures the *average* meaning of a word across its entire training dataset, rather than its specific usage in the current sentence.
*   **Context Insensitivity:** The same static vector is used for a word regardless of its surrounding context. For example, the word "bank" in "money bank" and "river bank" would use the **exact same numerical embedding**, even though the meanings are completely different (financial institution vs. river shore). This static representation cannot be a good starting point for accurate NLP tasks.

### 2. The Solution: Dynamic Contextual Embeddings
To solve this, a dynamic system is required that produces **Contextual Embeddings**. These embeddings must change based on the specific context of the sentence.

*   **Self-Attention's Role:** **Self-Attention** is the mechanism designed to convert these static word embeddings into dynamic, contextual embeddings. The contextual embedding of a word (e.g., "bank") should depend on and incorporate information from the other words in the sentence (e.g., "money" and "grows" or "river" and "flows").

## II. The Simple Self-Attention Model (First Principles)

The simplest approach to creating a contextual embedding is to calculate it as a **weighted sum** of all the static word embeddings in the sentence.

<img src="./images/sa1.png">
<img src="./images/sa2.png">

### 1. The Core Idea: Weighted Sum and Similarity
The new, contextual embedding of a word is represented as a combination of its own static embedding and the static embeddings of all other words in the sentence.

*   **Calculation:** The new embedding (which is a vector) is the result of a **weighted sum** of the original embeddings, where all components are $N$-dimensional vectors.
*   **Weights as Similarity:** The coefficients (weights) used in this sum must represent the **similarity** between the target word's embedding and the other words' embeddings.

### 2. Mathematical Steps of the Simple Model

The simple model to calculate a new contextual embedding ($Y_{new}$) involves three steps:

1.  **Calculate Raw Similarity Scores ($S_{i,j}$):** Similarity between high-dimensional vectors can be calculated using the **Dot Product**. The dot product between two word embeddings (e.g., E\_bank and E\_money) yields a scalar score ($S$).
2.  **Normalize Scores (Softmax):** These raw scores ($S_{i,j}$) are then passed through the **Softmax function**. Softmax normalizes the scores to be positive and ensures they sum to one. This converts them into weights ($W_{i,j}$) that can be interpreted as probabilities or alignment scores (e.g., "bank" is 70% related to itself, 20% to "money," and 10% to "grows").
3.  **Generate New Embedding ($Y_{new}$):** The final contextual embedding is generated by using these normalized weights ($W$) to calculate the **weighted sum** of the original static embeddings.

## III. Flaws and Refinement: The Need for Learning

The simple approach, while intuitive, exposes two major points of consideration that necessitate the refinement into the final Transformer mechanism.

### 1. Advantage: Parallel Processing
The operations required (matrix multiplication for dot product, Softmax, and matrix multiplication for the weighted sum) can all be achieved using **Linear Algebra**. This means that the contextual embeddings for *all words* in a sentence (be it 3 or 3,000) can be calculated **simultaneously (in parallel)**.

*   **Impact:** This parallelization is the key advantage of Self-Attention, allowing models to fully utilize GPU power and enabling **very fast training**.

### 2. Major Flaw: No Task-Specific Learning
The crucial drawback of the simple model is that it contains **no learning parameters** (weights and biases). It relies solely on dot products and Softmax, which are fixed mathematical operations.

*   **General Embeddings:** Because there is no learning, the model only produces **General Contextual Embeddings**, which are independent of the specific NLP task being performed (e.g., translation, sentiment analysis).
*   **The Problem of Idioms:** General embeddings fail when the translation or interpretation depends on nuances specific to the dataset (e.g., an idiom like "piece of cake," which should be translated as "very easy job," but a general model might translate it literally as "a slice of cake").
*   **Conclusion:** The model needs the ability to generate **Task-Specific Contextual Embeddings** by introducing **learnable parameters** that can adjust based on the training data.

## IV. The Refined Self-Attention: Query, Key, and Value (QKV)

To introduce learnable parameters, the simple process must be augmented.

### 1. The Three Roles of an Embedding
The critical insight is that within the simple process, every word's static embedding (e.g., $E_{bank}$) plays three distinct roles:

1.  **Query (Q):** The embedding acts as a question, asking other words for their similarity score ("How similar are you to me?").
2.  **Key (K):** The embedding acts as a reference point that replies to the Query, offering itself for similarity comparison.
3.  **Value (V):** The embedding acts as the content whose information is used in the final **weighted sum** to create the output.

### 2. Separation of Concerns and QKV Generation



It is not logical for a single vector to perform all three roles simultaneously. Instead, the static word embedding should be **transformed** into three separate, specialized vectors: Query (Q), Key (K), and Value (V).

This transformation is achieved using **Linear Transformation** (vector-matrix multiplication):

*   A vector is multiplied by a Matrix ($W$) to transform it into a new vector, changing its magnitude and direction.
*   Three distinct **Transformation Matrices** are introduced for each role: $W_Q, W_K, W_V$.

| Vector Generated | Calculation | Role/Purpose |
| :--- | :--- | :--- |
| **Query (Q)** | $E \cdot W_Q$ | Used to ask for similarity (the search query). |
| **Key (K)** | $E \cdot W_K$ | Used as a reference point to respond to queries (the profile). |
| **Value (V)** | $E \cdot W_V$ | Used in the final weighted sum (the content shared after a match). |

<img src="./images/sa3.png">
<img src="./images/sa4.png">

### 3. Introducing Learnable Parameters
These three transformation matrices ($W_Q, W_K, W_V$) contain the crucial **learnable parameters**.

*   **Training:** They are initialized with random values and become part of the training process. The **Backpropagation** algorithm updates the values within $W_Q, W_K,$ and $W_V$ based on the task loss.
*   **Task Specificity:** By learning optimal values from the data, the model gains the ability to generate Q, K, and V vectors that are **specific to the task** (e.g., machine translation), thereby solving the problem of General Contextual Embeddings.
*   **Weight Sharing:** It is critical to note that the same set of three matrices ($W_Q, W_K, W_V$) is used to generate the Q, K, and V vectors for **every single word** in the sentence.

## V. Final Self-Attention Flow

The final, refined Self-Attention mechanism operates in three main stages:

1.  **QKV Generation:** The static embedding of every word is transformed in parallel using $W_Q, W_K,$ and $W_V$ into its corresponding Q, K, and V vectors.
2.  **Attention Scores:** The similarity scores are calculated by taking the **dot product of Q and K** (instead of $E$ and $E$). This raw score matrix is then normalized using **Softmax** to get the final weights.
3.  **Contextual Output:** These final weights are multiplied by the **Value (V)** matrix to calculate the weighted sum, yielding the final, task-specific contextual embeddings.

This architecture—which completely replaces RNNs with Self-Attention—is the **Transformer**, and understanding this mechanism is necessary to understand the current generation of LLMs and Generative AI.

## I. Recap of Self-Attention and the Initial Formula

The goal of **Self-Attention** is to convert static word embeddings (which capture only the average meaning of a word) into dynamic, **contextual embeddings** (which reflect the word's meaning in a specific sentence).

### 1. The Core Calculation
The process involves calculating three matrices for the input sentence: Query ($Q$), Key ($K$), and Value ($V$). These are generated by multiplying the original word embeddings ($E$) by three distinct, **learnable transformation matrices** ($W_Q, W_K, W_V$).

The standard, non-scaled formula for calculating attention, derived from first principles, is:
$$\text{Attention}(Q, K, V) = \text{Softmax}(QK^T)V$$

This operation sequence involves:

1.  **Dot Product:** Calculating $Q$ times the transpose of $K$ ($QK^T$), yielding a matrix of raw similarity scores.
2.  **Softmax:** Normalizing these raw scores.
3.  **Weighted Sum:** Multiplying the result by the Value matrix ($V$) to produce the final contextual embeddings.

## II. The Scaling Factor and $d_k$

When comparing this formula derived from first principles with the formula presented in the original Transformer paper, "Attention Is All You Need," a key difference emerges: the original paper uses a scaling factor. This mechanism is called **Scaled Dot Product Attention**.

The finalized formula used in Transformers is:
$$\text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

### Defining $d_k$
The term $d_k$ is defined simply as the **dimension of the Key vector** (K vector).

*   If the Key vectors are 512 dimensions, $d_k$ is 512.
*   While $Q, K,$ and $V$ vectors can technically have different dimensions based on the shape of the transformation matrices, in simplified cases, the dimension of the Query vector ($d_q$), Key vector ($d_k$), and Value vector ($d_v$) might all be equal.
*   The scaling operation means that **every value** in the intermediate $QK^T$ matrix is divided by the square root of $d_k$ (e.g., $\sqrt{3}$).

## III. The Problem: Unstable Gradients and High Variance

The reason for this extra effort of scaling is to ensure **stable training** and prevent **vanishing gradients**. The root cause lies in the **nature of the dot product**.

### 1. High-Dimensional Vectors Yield High Variance
When calculating $QK^T$, multiple dot products between the individual Query vectors and Key vectors occur, resulting in a matrix of scalar scores.

*   **Low Dimension:** When the vectors are low-dimensional (e.g., 3-dimensional), the resulting dot product scores exhibit **low variance**.
*   **High Dimension:** As the vector dimension increases (e.g., 100 or 1,000 dimensions), the resulting dot product scores exhibit **high variance**.
*   **Linear Relationship:** There is a **linear relationship** between the dimension ($D$) and the variance: $\text{Variance} \propto D$. For example, a 3-dimensional vector dot product yields a variance roughly three times that of a 1-dimensional dot product.

### 2. The Softmax Effect (Extremes)
The problem occurs because these high-variance dot product scores are immediately fed into the **Softmax function**.

*   Softmax uses an exponential function to convert input scores into probabilities that sum to one.
*   When Softmax receives inputs with high variance (large gap between small and large numbers), it aggressively pushes the largest numbers toward an extreme probability (close to 100%) and the smallest numbers toward a near-zero probability.

### 3. Training Instability
These extreme probabilities lead to training instability:

*   **Focus Shift:** The training process, guided by the gradient of the loss function, puts its entire focus on correcting the errors associated with the few very large numbers.
*   **Vanishing Gradients:** The small numbers (those pushed close to 0%) are effectively **ignored**. Their corresponding parameters receive **very small gradients**, leading to the **vanishing gradient problem** where those weights cease to be updated, thus compromising the overall training process.
*   **Need for High Dimension:** While one simple solution might be to use low-dimensional embeddings, high-dimensional embeddings are necessary to extract sufficient **useful information** from the text. Therefore, the solution must be to control the variance of the high-dimensional dot product scores.

## IV. Mathematical Justification for Scaling by $\frac{1}{\sqrt{d_k}}$

The primary objective is to **control the variance** of the $QK^T$ matrix, ensuring that the variance remains constant regardless of the dimension $d_k$.

### 1. Reducing Variance
The simplest way to reduce the variance of a set of numbers is to **divide them by a constant** (scaling factor).

### 2. Using the Square Root of the Dimension
The precise scaling factor of $\frac{1}{\sqrt{d_k}}$ is required because of a mathematical rule regarding variance:

*   If you have a random variable $X$ with variance $\text{Var}(X)$, and you create a new variable $Y$ by scaling $X$ by a constant $C$ ($Y=CX$), the variance of $Y$ is $\text{Var}(Y) = C^2 \text{Var}(X)$.
*   Since the total variance ($\text{Var}_{\text{total}}$) increases linearly with the dimension ($d_k$), we know: $\text{Var}_{\text{total}} = d_k \times \text{Var}_{\text{1D}}$.
*   To keep the final variance constant (equal to $\text{Var}_{\text{1D}}$), we need $C^2$ to be equal to $\frac{1}{d_k}$.
    $$\text{Target Variance} = C^2 \times \text{Var}_{\text{total}} = C^2 \times d_k \times \text{Var}_{\text{1D}}$$
    $$\text{Var}_{\text{1D}} = C^2 \times d_k \times \text{Var}_{\text{1D}}$$
    $$\implies C^2 = \frac{1}{d_k}$$
    $$\implies C = \frac{1}{\sqrt{d_k}}$$

Therefore, dividing every raw similarity score in the $QK^T$ matrix by $\sqrt{d_k}$ mathematically ensures that the variance of those scores is stabilized, regardless of how large the underlying vector dimension ($d_k$) is. This stability prevents the Softmax function from creating extreme probabilities, thereby enabling **stable and effective training**.

***
*Analogy:* Think of the scaling factor $\frac{1}{\sqrt{d_k}}$ as a volume control on a microphone attached to a large stadium. As the dimension of the data ($d_k$) gets higher (like adding more people to the crowd), the raw volume (variance) automatically increases. If you fed that raw, loud volume into the Softmax system (which is like a recording studio engineer who only pays attention to the loudest sounds), the quieter, nuanced sounds would be missed (vanishing gradients). The scaling factor acts as an automated gain knob, instantly lowering the volume by the precise mathematical amount needed to keep the input volume stable and consistent, ensuring the engineer (Softmax) pays attention to all sounds equally.