We will do relative positional encoding with modifications from the T5 paper

The basic idea of relative pos encoding is using a matrix like this:

tensor([[ 0, 1, 2, 3],
        [-1, 0, 1, 2],
        [-2,-1, 0, 1],
        [-3,-2,-1, 0]])

So that each positioning is encoded relative to the "current token" (query).

Note that for decoder self-attention we want to use causal masking (meaning: we don't want to give ANY info about future tokens)
so it will look like this:

tensor([[ 0, 0, 0, 0],
        [-1, 0, 0, 0],
        [-2,-1, 0, 0],
        [-3,-2,-1, 0]])

and in this implementation we will not use negatives (does not matter) so it will be

tensor([[ 0, 0, 0, 0],
        [ 1, 0, 0, 0],
        [ 2, 1, 0, 0],
        [ 3, 2, 1, 0]])

We'll then apply "buckets" where ranges will get the same pos. encoding, with the first 3 "past" tokens have their exact position, so it will look something like this for the last row (larger seq_len):

[7,7,7,7,6,6,6,5,5,5,4,4,3,2,1,0,]])


Some general info about nn.Embedding:

Both nn.Linear and nn.Embedding will given you, in your example, a 3-dim vector. That’s the whole point, i.e., to convert a token 
into an ideally meaningful vectors (i.e., a numeric and fix-sized representation of a word). The difference is w.r.t. the input
nn.Linear expects a one-hot vector of the size of the vocabulary with the single 1 at the index representing the specific word.
nn.Embedding just expects this index (and not a whole vector).

However, if both nn.Linear and nn.Embedding would be initialized with the same weights, their outputs would be exactly the same.

Yes, by default, the weights of both layers will be modified during the training process. In this respect, there are like any other 
layers in your network. However, you can tell the network not to modify the weights of any specific layer; I think it would look 
something like this:

embedding = nn.Embedding(10, 3)
embedding = weight.requires_grad = False

This makes sense if you use pretrained word embeddings such as Word2Vec or Glove. If you initialize your weights randomly, you 
certainly want them to be modified during training.

In [31]:
import numpy as np
import torch
from torch import nn, einsum
import torch.nn.functional as F
import math

num_buckets = 6
max_distance = 20    # Max sequence length - this will be 128 as per the paper
seq_len = 15         # This is query length
max_context_len = 15 # This is key length - normally same as query length but not for XL Trfrmrs where we concat keys as part of recurrency

# Now we construct a matrix as per the above

q_pos = torch.arange(seq_len, dtype=torch.long)               # Top row
k_pos = torch.arange(max_context_len, dtype=torch.long) 

# Trick:
#[0, 1, 2, 3] - [[0], == (via broadcasting) [[0, 1, 2, 3] - [[0, 0, 0, 0], == [[ 0, 1, 2, 3], 
#                [1],                        [0, 1, 2, 3]    [1, 1, 1, 1],     [-1, 0, 1, 2],
#                [2],                        [0, 1, 2, 3]    [2, 2, 2, 2],     [-2,-1, 0, 1],
#                [3]]                        [0, 1, 2, 3]]   [3, 3, 3, 3]]     [-3,-2,-1, 0]]

# So we need to convert q_pos to a column vector:
q_pos = q_pos.reshape(q_pos.shape[0], 1)

rel_pos = k_pos - q_pos
#rel_pos # With seq_len 10 for query and max_context_len 15 for (concatenated) keys this gives:
# Query goes "up/down" since we only have the current sequence, but we match it with a concat of keys for recurrence ->>
#tensor([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14],
#        [-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13],
#        [-2, -1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12],
#        [-3, -2, -1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
#        [-4, -3, -2, -1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
#        [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
#        [-6, -5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5,  6,  7,  8],
#        [-7, -6, -5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5,  6,  7],
#        [-8, -7, -6, -5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5,  6],
#        [-9, -8, -7, -6, -5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5]])

# Next: since we'er building an encoder, we "mask" the future by putting it to 0 i.e., we don't encode anything for the future 
# Also we make neg pos, just for convenience - doesn't really matter since it's all relative and consistent

rel_pos = -rel_pos


rel_pos = torch.max(rel_pos, torch.zeros_like(rel_pos))

#rel_pos # For 10x20
#tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#        [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#        [2, 1, 0, 0, 0, 0, 0, 0, 0, 0],
#        [3, 2, 1, 0, 0, 0, 0, 0, 0, 0],
#        [4, 3, 2, 1, 0, 0, 0, 0, 0, 0],
#        [5, 4, 3, 2, 1, 0, 0, 0, 0, 0],
#        [6, 5, 4, 3, 2, 1, 0, 0, 0, 0],
#        [7, 6, 5, 4, 3, 2, 1, 0, 0, 0],
#        [8, 7, 6, 5, 4, 3, 2, 1, 0, 0],
#        [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]])

#rel_pos # For seq_len/query 10 and max_context_len/keys 15:
#tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#        [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#        [2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#        [3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#        [4, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#        [5, 4, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#        [6, 5, 4, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#        [7, 6, 5, 4, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0],
#        [8, 7, 6, 5, 4, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0],
#        [9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, 0, 0, 0, 0]])                 

# Now for the T5 modifications == the buckets

# First half of the buckets are the actual tokens, so "buckets" with just one token in them

num_token_buckets = num_buckets // 2    # This is 3 if num_buckets is 6, so 0, 1, 2 are exact and in single-item buckets

# We're making the changes by applying masks on the matrix elements

# First a mask that puts "True" on items that don't need to change (first 3), and False on all the others

is_exact = rel_pos < num_token_buckets

#is_exact
# Last line is [False, False, False, False, False, False, False,  True,  True,  True]]) so that's True for ... 2, 1, 0]])  

# Second mask: a mask that logaritmically puts more and more items in bins, up to max_distance.
# This works by transforming the number to a max of num_buckets

val_if_large = \
num_token_buckets + \
(torch.log(rel_pos.float() / num_token_buckets) / math.log(max_distance / num_token_buckets) * (num_buckets - num_token_buckets))

# val_if_large
# [5.4360, 5.3188, 5.1922, 5.0546, 4.9039, 4.7373, 4.5510, 4.3399, 4.0961, 3.8078, 3.4549, 3.0000, 2.3588, 1.2627,   -inf]]) 

# long() just converts to int
val_if_large = val_if_large.long()

#val_if_large
# [5, 5, 5, 5, 4, 4, 4, 4, 4, 3, 3, 3, 2, 1, -9223372036854775808]]) # The last one is the smallest long int

position_bucket_indices = torch.where(is_exact, rel_pos, val_if_large) # Where is_exact is True, put value from n, otherwise value from v_i_l

position_bucket_indices -> 0, 1, 2 are always exact, from 3 on we start proper bucketing

# Now we need to turn ALL of these items in positional embeddings






tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [3, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [3, 3, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [4, 3, 3, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [4, 4, 3, 3, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [4, 4, 4, 3, 3, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0],
        [4, 4, 4, 4, 3, 3, 3, 2, 1, 0, 0, 0, 0, 0, 0],
        [4, 4, 4, 4, 4, 3, 3, 3, 2, 1, 0, 0, 0, 0, 0],
        [5, 4, 4, 4, 4, 4, 3, 3, 3, 2, 1, 0, 0, 0, 0],
        [5, 5, 4, 4, 4, 4, 4, 3, 3, 3, 2, 1, 0, 0, 0],
        [5, 5, 5, 4, 4, 4, 4, 4, 3, 3, 3, 2, 1, 0, 0],
        [5, 5, 5, 5, 4, 4, 4, 4, 4, 3, 3, 3, 2, 1, 0]])