### section 3.3: Attending to different parts of the input with self-attention


Our goal is to build a *context vector* which can be thought of as an enriched embedding vector. We take a token and get the similarity to all other tokens in the batch to produce a new embedding that gets added back, I believe.

In [1]:
import torch

In [2]:
example_text = "Your journey starts with one step"

In [3]:
inputs = torch.tensor(
    [[0.43, 0.15, 0.89], # Your
     [0.55, 0.87, 0.66], # journey
     [0.57, 0.85, 0.64], # starts
     [0.22, 0.58, 0.33], # with
     [0.77, 0.25, 0.10], # one
     [0.05, 0.80, 0.55]] # step
)

Focusing on the second term - "journey"

In [4]:
query = inputs[1]

In [5]:
attn_scores_2  = torch.empty(inputs.shape[0])
for i, x in enumerate(inputs):
    attn_scores_2[i] = torch.dot(query, x)

After we generate the dot products, we want to normalize the scores. This has 2 benefits:

1. The weights all sum to one.
2. It makes for more stable training - if these values get small and we keep multiplying them with downstream weights, we can look precision.

In [6]:
attn_scores_2 / attn_scores_2.sum()

tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])

In [7]:
# verify these sum to 1
(attn_scores_2 / attn_scores_2.sum()).sum()

tensor(1.0000)

This is a more naive way to normalize - in practice people tend to use [softmax](https://www.youtube.com/watch?v=KpKog-L9veg). Which we can think of as a differentiable max function, which is helpful when we backpropagate. Let's calculate it ourselves first.

Another benefit of this is that the weights end up being positive so we can view them as probabilities of sort.

In [8]:
def naive_softmax(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

In [9]:
naive_softmax(attn_scores_2)

tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])

Even so, this naive calculation can run into numeric issues with large or small inputs, so it's advisable to always use the torch implementation, which has been optimized for performance. In this scenario they match!

In [10]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print(attn_weights_2)

tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])


In [11]:
context_vector_2 = torch.zeros(query.shape)

for i, x in enumerate(inputs):
    context_vector_2 += x * attn_weights_2[i]

print(context_vector_2)


tensor([0.4419, 0.6515, 0.5683])


Now to do this for all weights. The below are not normalizaed.

In [12]:
attn_scores = torch.empty(6,6)

for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)

print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


For loops are slow. We can achieve the same output using matrix multiplication.

In [13]:
attn_scores = inputs @ inputs.T

attn_scores

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])

In [14]:
# normalizing

attn_weights = torch.softmax(attn_scores, dim=1)
attn_weights

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])

In [15]:
# context vectors. Note we calculated row 2 by itself above

context_vectors = attn_weights @ inputs
context_vectors

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])

### section 3.4: Implementing self-attention with trainable weights

Also known as scaled-dot-product attention. This is critical to the transformer architecture. Building off of the simple attention we completed in section 3.3, we want to generated context vectors which are weighted sums of the inputs with respect to each individual input token.

By making the weights trainable, we give the attention mechanism information to make "better" context vectors

In [16]:
x_2 = inputs[1]
d_in = inputs.shape[1]
d_out = 2

Normally the dimensions in and out are the same, but here we are using different sizes to better illustrate the work.

In [17]:
# intialize the query, key, and value matrices
torch.manual_seed(123)
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

`requires_grad` is set to `False` to limit clutter in the following outputs. They need to be `True` once we start actually training.

In [18]:
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value

print(query_2)

tensor([0.4306, 1.4551])


In [19]:
keys = inputs @ W_key
values = inputs @ W_value
print(f"keys shape: {keys.shape}")
print(f"values shape: {values.shape}")

keys shape: torch.Size([6, 2])
values shape: torch.Size([6, 2])


In [20]:
keys_2 = keys[1]
attn_score_22 = query_2.dot(keys_2)
attn_score_22

tensor(1.8524)

In [21]:
attn_score_2 = query_2 @ keys.T
attn_score_2

tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])

This scaling of the attention scores help us avoid large values going into the softmax function and producing small gradients. This can lead to slow or stagnated training.

In [22]:
# normalizing

d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_score_2 / (d_k ** 0.5), dim=-1)
attn_weights_2

tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])

In [23]:
context_vec_2 = attn_weights_2 @ values
context_vec_2

tensor([0.3061, 0.8210])

So *really* what we have are 3 weight matrices that are transformations of our input tokens and they act like certain parts of a database.

* query - is the thing we're looking at. Whatever token we're focused on in a given moment.
* key - is the main lookup kind of value when our query is applied.
* value - is what that key retrieves

So the "DB" representation still needs to be learned. We transform a given token to some representation and then weigh the values by their attention score (which is a dot product of the query and keys).

### section 3.4.2: Implementing a compact self-attention Python class

We can do all of the above steps succinctly using a Python class. 

In [24]:
from utils import SelfAttention_v1

In [25]:
torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in, d_out)
print(sa_v1(inputs))

tensor([[0.2996, 0.8053],
        [0.3061, 0.8210],
        [0.3058, 0.8203],
        [0.2948, 0.7939],
        [0.2927, 0.7891],
        [0.2990, 0.8040]], grad_fn=<MmBackward0>)


It makes more sense to use the built in Linear layers - they have a more sophisticated intialization scheme. 

In [26]:
from utils import SelfAttention_v2
torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))

tensor([[-0.0739,  0.0713],
        [-0.0748,  0.0703],
        [-0.0749,  0.0702],
        [-0.0760,  0.0685],
        [-0.0763,  0.0679],
        [-0.0754,  0.0693]], grad_fn=<MmBackward0>)


### Exercise 3.1

Making sure that the 2 implementations otherwise behave the same, we will apply the weights of one to the other.

In [32]:
sa_v1.W_key = torch.nn.Parameter(sa_v2.W_key.weight.T)
sa_v1.W_query = torch.nn.Parameter(sa_v2.W_query.weight.T)
sa_v1.W_value = torch.nn.Parameter(sa_v2.W_value.weight.T)


In [33]:
sa_v1(inputs)

tensor([[-0.0739,  0.0713],
        [-0.0748,  0.0703],
        [-0.0749,  0.0702],
        [-0.0760,  0.0685],
        [-0.0763,  0.0679],
        [-0.0754,  0.0693]], grad_fn=<MmBackward0>)

### section 3.5: Hiding Future Words with Causal Attention

So it's all fine and dandy to build attention with every other token in a sequence, but it's not helpful when what we're trying to build is a model that can generate the next token. So here we will add in a mechanism that avoids the future tokens.