# Positional Encoding

REF: https://medium.com/@hunter-j-phillips/positional-encoding-7a93db4109e6

INPUT: Word Vector / Word Extracted from Corpus
OUTPUT: Word Vector that added positional encoding

- Periodicity (repeatability of their pattern)
- easy to predict long sequence
- constrained value
    - Nearest word will get highest output and gradualy decrease by how far from first word.
    - sin and cosine output in -1 to 1
- Give position to each word in context to provide a relative position in word sequence
- The model uses embedding vectors of length `d_model` to represent each word as embedding matrix
- Use Sine and Cosine function to generate unique vector for each position in sequence
    - Better than integer because Sine and Cosine output will scoped in [-1, 1]
    - No additional training has to be done since unique representations are generated for each position.
- Positional encoding matrix: will always produce the same output (and use to added back to input later)
- This layer is just added (+) fixed set with non-learnable parameters
- Parameter
    - max_length >= len(input): to ensure to support future input length (L)
    - n: recommend 10000 by paper
    - d_model: dimensions of input ()

<img src="./images/positional_encoding_s1.png">

#### Steps
- Get input corpus
- Extracted batch by window size
- Replace with positional index [0, n]
- Create positional matrix followed by equation
- Do word embedding to input
- Add word embedding with positional matrix



##### Step 1:


Q: Why use both sine and cosine to do unique

A: To further reduce the chance of different positions having the same encoding

<img src="./images/positional_encoding_s2.png">

- The sine and cosine functions have values in [-1, 1], which keeps the values of the positional encoding matrix in a normalized range.
- As the sinusoid for each position is different, you have a unique way of encoding each position.
- You have a way of measuring or quantifying the similarity between different positions, hence enabling you to encode the relative positions of words.

NOTE: Can use with learnable parameters (but not have any paper confirm for better performance yet)

In [4]:
import math
import numpy as np

def gen_pe(max_length, d_model, n):

    # generate an empty matrix for the positional encodings (pe)
    pe = np.zeros(max_length*d_model).reshape(max_length, d_model) 

    # for each position
    for k in np.arange(max_length):

        # for each dimension
        for i in np.arange(d_model//2):

            # calculate the internal value for sin and cos
            theta = k / (n ** ((2*i)/d_model))       

            # even dims: sin   
            pe[k, 2*i] = math.sin(theta) 

            # odd dims: cos               
            pe[k, 2*i+1] = math.cos(theta)

    return pe
    
for i in range(6):
    s_result = math.sin(i)
    print(f'{i}: {s_result}')

for i in range(6):
    c_result = math.cos(i)
    print(f'{i}: {c_result}')

# Used these vectors to added with input embedded vectors
print(gen_pe(6, 4, 10000))

0: 0.0
1: 0.8414709848078965
2: 0.9092974268256817
3: 0.1411200080598672
4: -0.7568024953079282
5: -0.9589242746631385
0: 1.0
1: 0.5403023058681398
2: -0.4161468365471424
3: -0.9899924966004454
4: -0.6536436208636119
5: 0.28366218546322625
[[ 0.          1.          0.          1.        ]
 [ 0.84147098  0.54030231  0.00999983  0.99995   ]
 [ 0.90929743 -0.41614684  0.01999867  0.99980001]
 [ 0.14112001 -0.9899925   0.0299955   0.99955003]
 [-0.7568025  -0.65364362  0.03998933  0.99920011]
 [-0.95892427  0.28366219  0.04997917  0.99875026]]
