## I. The Necessity of Positional Encoding (The Problem)

The sources establish that the **Self-Attention** module, which forms the core of the Transformer, possesses two major advantages over previous architectures like RNNs:

1.  **Contextual Embeddings:** Self-Attention can generate dynamic, contextual embeddings that change based on how a word is used in a sentence (e.g., distinguishing "river bank" from "money bank").
2.  **Parallel Computation:** All contextual embeddings can be calculated **simultaneously (in parallel)**, which is crucial for speed and processing large documents (e.g., a 10,000-word PDF).

### The Fatal Flaw: Ignoring Word Order
The primary drawback of the parallel calculation is that the **Self-Attention module cannot capture the order or sequence of the words**. Since all words are processed at the same time, the module has no inherent mechanism to understand whether "River" came before "Bank".

*   **Example of Ambiguity:** For the Self-Attention module, the sentence "Nitish killed Lion" is **exactly the same** as "Lion killed Nitish," even though they represent completely different meanings.
*   **The Challenge:** Word order is critical in NLP applications. Therefore, the challenge is to find a way to **encode the order information** and pass it to the Self-Attention module so that the Transformer architecture can perform well.

## II. Initial, Flawed Solutions (First Principles Approach)

The lecture notes adopt a **First Principle Approach** to teach Positional Encoding by exploring simple solutions and identifying their flaws.

### A. Solution 1: Simple Counting
The simplest solution is to count the position of each word (1, 2, 3, 4, and so on) and **append this number as a new dimension** to the word's embedding vector (e.g., appending '1' to the embedding of the first word, '2' to the second, making the vector 513-dimensional if the original was 512-dimensional).

#### Problems with Simple Counting:
1.  **Unbounded Values:** This approach is **unbounded**; the position number grows linearly with the document length. If a document has 100,000 words, the final number is 100,000, which is huge.
    *   **Neural Network Conflict:** Neural networks (which rely on **Backpropagation**) dislike large numbers because they lead to **unstable gradients**, causing issues like vanishing or exploding gradients. The preferred range for numbers in NNs is generally between -1 and 1.
2.  **Normalization Fails:** Attempts to solve the unbounded problem by **normalizing** the positions (dividing by total word count) also fail. Normalization leads to **inconsistent values** for the same position across different sentence lengths (e.g., the second position might be normalized to 1 in one short sentence but 0.5 in a long sentence), which confuses the neural network.
3.  **Discrete Numbers:** The positional numbers (1, 2, 3, 4, etc.) are **discrete**. Neural networks generally prefer **continuous** numbers and smooth transitions, as discrete values can cause issues with numerical stability and gradient flow.
4.  **No Relative Position:** Simple counting only captures **Absolute Position** (the unique position of a word). It **cannot capture Relative Position**â€”the distance between two words. The model cannot easily calculate the distance (e.g., 3 - 1 = 2) because the input comes from a discrete function.

## III. The Breakthrough: Using Trigonometric Functions

The analysis of the flaws in simple counting led to the realization that an ideal function must be **bounded, continuous, and periodic**. This pointed directly to **trigonometric functions**, specifically the sine function.

### A. Solution 2: Single Sine Function
The second proposed solution involves using the sine function: $Y = \sin(\text{Position})$, where the word's position (e.g., 1, 2, 3) is the input to the sine function, and the output (Y, the encoded value) is appended to the embedding.

#### Advantages of the Sine Function:
*   **Bounded:** The output value ($Y$) always stays between **-1 and 1**, solving the unbounded problem.
*   **Continuous:** The sine curve is a continuous function, addressing the discrete number problem.
*   **Relative Position (Theoretical):** Theoretically, a periodic function like the sine curve **can capture relative position**, though the lecturer reserves the mathematical proof for later.

#### The Critical Flaw: Non-Unique Values (Periodicity)
The major flaw of a single sine function is its **periodicity**.

*   Since the curve repeats its pattern, two words at different positions (e.g., Position 3 and Position 35) might generate the **exact same positional encoding value**.
*   If two words have the same PE value but different positions, the model will be confused, thinking they are at the same position, which leads to major errors.

### B. Solution 3: The Sin-Cos Pair (Vector Representation)
The problem of repeating values is solved by switching from a single function (scalar output) to a **pair of trigonometric functions (vector output)**.

*   **Mechanism:** For every word, two values are calculated: one from the **sine curve** and one from the **cosine curve**.
*   **Vector Representation:** The Positional Encoding is now represented as a **vector** (e.g., $[0.84, 0.5]$) rather than a single scalar value.
*   **Advantage:** The probability of two different words having the **exact same two numbers** (the same vector) is significantly lower, reducing the chance of value repetition.

### C. Solution 4: Multiple Sin-Cos Pairs (Varying Frequency)
To handle very long documents where even a Sin-Cos pair might eventually repeat, the solution is to **increase the dimensionality** of the Positional Encoding vector by adding **more pairs of Sin-Cos functions**.

*   **Pair Extension:** Subsequent pairs are introduced by dividing the position by an increasing factor: $\sin(\text{Position}/2)$ and $\cos(\text{Position}/2)$, then $\sin(\text{Position}/3)$ and $\cos(\text{Position}/3)$, and so on.
*   **Frequency Modulation:** As more pairs are added, the **frequency** of the curves is continually **reduced**. This ensures that even for very large documents, the final, high-dimensional PE vector is extremely unlikely to repeat.

## IV. The Final Implementation in the Transformer

The final solution adopted in the "Attention Is All You Need" paper is the advanced version of the multiple Sin-Cos pair approach.

### A. PE Vector Properties
1.  **PE is a Vector:** The Positional Encoding for every word is a **vector** (a set of numbers).
2.  **Dimension Matching:** The dimension of the PE vector must **exactly match** the dimension of the word's Embedding Vector ($D_{\text{model}}$). If the embedding is 512-dimensional, the PE must also be 512-dimensional.
3.  **Combination Method:** The PE vector is **added** to the Embedding Vector, not concatenated.
    *   **Rationale for Addition:** Concatenation would double the dimension (e.g., 512 to 1024), which would double the **number of parameters** in the neural network and significantly increase the **training time**. Addition maintains the dimension, keeping the training fast.

### B. Calculation Using the Official Formula
The values inside the PE vector are calculated using the official formula from the paper:

$$\text{PE}(\text{pos}, i) = \begin{cases} \sin(\text{pos} / 10000^{2i/D_{\text{model}}}) & \text{if } i \text{ is even} \\ \cos(\text{pos} / 10000^{2i/D_{\text{model}}}) & \text{if } i \text{ is odd} \end{cases}$$

*   **$D_{\text{model}}$:** Dimension of the embedding (e.g., 512).
*   **pos (Position):** The position of the word (starts at 0 for the first word).
*   **$i$ (Dimension Index):** Fluctuation from 0 up to $D_{\text{model}} - 1$.
*   **Paired Calculation:** For every two consecutive dimensions (an even $i$ and an odd $i$), a **Sin-Cos pair** is used.
*   **Frequency Control:** The denominator term, $10000^{2i/D_{\text{model}}}$, mathematically controls the **frequency** of the sine/cosine waves, ensuring the frequency decreases for every successive pair.

<img src="./images/pe1.png">

*   **Visualization:** Plotting these PE vectors shows a pattern highly similar to **Binary Encoding**, but implemented in the **continuous value domain** (using sine/cosine) instead of discrete values, which is exactly what neural networks prefer.

## V. The Capture of Relative Position (The Mind-Blowing Property)

The lecture notes save the discussion of **Relative Position** for last, as it is the most sophisticated property.

### A. The Hidden Property
The way the Sin-Cos PE vectors are constructed imparts a "mind-blowing" property: for any given **delta (distance)** between two positions (e.g., $\Delta = 10$), there exists a **single, specific Linear Transformation (Matrix)** that can convert the positional vector of the starting position ($V_{10}$) into the positional vector of the end position ($V_{20}$).

*   **Example:** A specific matrix $M_{\Delta=10}$ can transform $V_{10} \to V_{20}$. Crucially, this *same* matrix $M_{\Delta=10}$ can also transform $V_{30} \to V_{40}$.
*   **Generalization:** For every possible distance ($\Delta=1, \Delta=2, \dots$), there is a corresponding Linear Transformation matrix available in the system.

### B. Implication
This means that the system inherently knows the distance between any two positions, allowing it to capture **Relative Position**, a feature that was impossible to encode with simple discrete numbers.

### C. Mathematical Rationale
The reason this linear relationship exists is precisely **because of the use of the Sin-Cos pair**. The pair forces the mathematical convergence required to capture relative distance.

This property is what transforms Positional Encoding from a simple counting mechanism into a powerful, learned representation of sequence order within the Transformer architecture.

<a href="https://blog.timodenk.com/linear-relationships-in-the-transformers-positional-encoding/">Blog link</a>