### section 3.2: Capturing data dependencies with attention mechanisms


Our goal is to build a *context vector* which can be thought of as an enriched embedding vector. We take a token and get the similarity to all other tokens in the batch to produce a new embedding that gets added back, I believe.

In [44]:
import torch

In [36]:
example_text = "Your journey starts with one step"

In [46]:
inputs = torch.tensor(
    [[0.43, 0.15, 0.89], # Your
     [0.55, 0.87, 0.66], # journey
     [0.57, 0.85, 0.64], # starts
     [0.22, 0.58, 0.33], # with
     [0.77, 0.25, 0.10], # one
     [0.05, 0.80, 0.55]] # step
)

Focusing on the second term - "journey"

In [48]:
query = inputs[1]

In [51]:
attn_scores_2  = torch.empty(inputs.shape[0])
for i, x in enumerate(inputs):
    attn_scores_2[i] = torch.dot(query, x)

After we generate the dot products, we want to normalize the scores. This has 2 benefits:

1. The weights all sum to one.
2. It makes for more stable training - if these values get small and we keep multiplying them with downstream weights, we can look precision.

In [53]:
attn_scores_2 / attn_scores_2.sum()

tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])

In [54]:
# verify these sum to 1
(attn_scores_2 / attn_scores_2.sum()).sum()

tensor(1.0000)

This is a more naive way to normalize - in practice people tend to use [softmax](https://www.youtube.com/watch?v=KpKog-L9veg). Which we can think of as a differentiable max function, which is helpful when we backpropagate. Let's calculate it ourselves first.

Another benefit of this is that the weights end up being positive so we can view them as probabilities of sort.

In [55]:
def naive_softmax(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

In [56]:
naive_softmax(attn_scores_2)

tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])

Even so, this naive calculation can run into numeric issues with large or small inputs, so it's advisable to always use the torch implementation, which has been optimized for performance. In this scenario they match!

In [58]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print(attn_weights_2)

tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])


In [60]:
context_vector_2 = torch.zeros(query.shape)

for i, x in enumerate(inputs):
    context_vector_2 += x * attn_weights_2[i]

print(context_vector_2)


tensor([0.4419, 0.6515, 0.5683])


Now to do this for all weights