## Title 
Exercise: Self-Attention

## Description :
In this exercise, you will implement a Self-Attention Head for the **3rd word** of a 4-word input. From a Pickled file, we load `query`, `key`, and `value` vectors that corresponds to 4 different inputs (thus, a total of 12 vectors). Specifically, this is loaded into 3 respective `dicts`.

You only need to calculate the final `z` "context" vector that corresponds to the 3rd word (i.e., `z2`).

#### <font color="red">**REMINDER**</font>: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.

In [2]:
# imports useful libraries
import math
import pickle
import numpy as np

<font color="red">YOU DO NOT NEED TO EDIT THE CELL BELOW</font>

The follow code loads the `queries`, `keys`, and `values` vectors that correspond to 4 distinct words. Here, all vectors have a length of 25. Each of these variables (e.g., `queries`) is a `dict`, indexed by the word number. For example `queries[2]` corresponds to the 3rd word (it's 0-indexed), and its value is a list of length 25, which corresponds to the actual _query_ vector.

In [3]:
# Create random queries, keys, values for 4 words, each of length 25
queries = {i: np.random.randn(25).tolist() for i in range(4)}
keys = {i: np.random.randn(25).tolist() for i in range(4)}
values = {i: np.random.randn(25).tolist() for i in range(4)}

# Save to L25.p
with open("L25.p", "wb") as f:
    pickle.dump([queries, keys, values], f)

print("L25.p file has been created successfully.")

L25.p file has been created successfully.


In [4]:
pickled_content = pickle.load(open("L25.p", "rb"))
queries, keys, values = [pickled_content[i] for i in range(3)]

# to illustrate, let's print the query, key, and value vectors that correspond to teh 3rd word in the sentence:
print("query vector for 3rd word:", queries[2])
print("\nkey vector for 3rd word:", keys[2])
print("\nvalue vector for 3rd word:", values[2])

query vector for 3rd word: [0.31255341745661447, 0.5444790208080438, 0.8632430499129878, -1.051515695088019, 0.724062761084608, 1.7988668525358067, -2.335102779369179, -0.7719423128312118, -0.9059941854398721, 0.8248369346074708, 0.631856726562657, -0.5425496355876188, 0.2436146146419746, 0.5162508478350442, 1.2239919241719166, -1.0909014329506188, -0.3726517089845168, 2.239444930344251, 1.157228796616806, -0.9081594626858468, -0.5572001524430997, -1.7401066957543887, 0.387705323294313, -0.3395043240436249, -1.268354233503594]

key vector for 3rd word: [-0.9008351905899606, 0.38919556285089085, 0.7124181581634611, 1.3423880876814092, -0.7406983167245698, 0.6855559198309097, 0.4734445175347144, -0.240820312768575, 0.44226398751729934, -1.6197564972256955, 0.11427482289977442, -0.9939044298094357, 0.051007671999343945, 0.34562879601223684, 1.4798319833963203, -1.7398377320192, -0.8878764079461707, -2.143627243195943, -0.9348044579989998, -0.35917632890486406, -0.08221775624778214, -1.732

<font color="red">YOU DO NOT NEED TO EDIT THE CELL BELOW</font>

In [5]:
# returns the dot product of two passed-in vectors
def calculate_dot_product(v1, v2):
    return sum(a*b for a, b in zip(v1, v2))

In the cell below, populate the `self_attention_scores` list by calculating the **four** Attention scores that correspond to the **3rd word**. Each Attention score should be divided by $\sqrt{d_k}$ (where $d_k$ represents the length of the key vector), and the ordering should be natural. That is, the 1st item in `self_attention_scores` should correspond to the Attention score for the 1st word, the 2nd item in `self_attention_scores` should correspond to the Attention score for 2nd word, and so on.

In [6]:
### edTest(test_a) ###
self_attention_scores = []

# YOUR CODE HERE
scaling_factor = math.sqrt(len(keys[0]))

# Query for the 3rd word
query_vector = queries[2]

# Loop through all 4 words and compute attention scores
for i in range(len(keys)):
    score = calculate_dot_product(query_vector, keys[i]) / scaling_factor
    self_attention_scores.append(score)

self_attention_scores

[0.3163311366417093,
 -2.006985897795121,
 0.22941007018224074,
 -0.8547962628439226]

In the cell below, populate the `softmax` list by calculating the softmax of each of the four Attention scores found in the previous cell.

In [7]:
### edTest(test_b) ###
softmax_scores = []

# YOUR CODE HERE
# exponentiate each score
exp_scores = [math.exp(score) for score in self_attention_scores]

# sum of exponentiated scores
total = sum(exp_scores)

# normalize each score
softmax_scores = [exp_score / total for exp_score in exp_scores]

softmax_scores

[0.4301602867829034,
 0.04213340376804944,
 0.3943492084956433,
 0.13335710095340383]

In the cell below, create the final $z2$ list that corresponds to the 3rd word. $z2$ should have a length of 25.

In [8]:
### edTest(test_c) ###
z2 = []

# YOUR CODE HERE
for dim in range(len(values[0])):  
    weighted_sum = 0
    # Sum over all 4 words
    for i in range(len(values)):  
        weighted_sum += softmax_scores[i] * values[i][dim]
    z2.append(weighted_sum)

z2

[0.4988362061113303,
 1.0399374204971337,
 0.010786855250006722,
 -0.505420903537367,
 0.16399025809114082,
 0.0951801722193798,
 -0.6380246648103174,
 -0.372752786449124,
 -0.45656799324064706,
 0.1762284678524646,
 0.3443050534150157,
 -0.071734330410285,
 -0.5963679511335237,
 0.1144371536018865,
 -0.4253568999284065,
 -0.162668014617458,
 0.400086978244591,
 0.9843709691966013,
 0.43310929900201445,
 0.35917217039833244,
 0.1591683180161405,
 0.7356985907661435,
 0.12465708738832049,
 0.9934073233690177,
 -0.7258839098784635]