<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#What-is-attention-in-ML?" data-toc-modified-id="What-is-attention-in-ML?-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>What is attention in ML?</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Attention-via-NumPy-and-SciPy" data-toc-modified-id="Attention-via-NumPy-and-SciPy-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Attention via <code>NumPy</code> and <code>SciPy</code></a></span></li><li><span><a href="#References" data-toc-modified-id="References-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>References</a></span></li></ul></div>

# Introduction
<hr style = "border:2px solid black" ></hr>

<div class="alert alert-warning">
<font color=black>

**What?** Attention

</font>
</div>

# What is attention in ML?
<hr style = "border:2px solid black" ></hr>

<div class="alert alert-info">
<font color=black>

- An attention-based system is thought to consist of three components:

    - A process that “reads” raw data (such as source words in a source sentence), and converts them into **distributed representations**, with one feature vector associated with each word position. 

    - A list of feature vectors storing the output of the reader. This can be understood as a “memory” containing a sequence of facts, which can be retrieved later, not necessarily in the same order, **without having to visit all of them**.

    - A process that “exploits” the content of the memory to sequentially perform a task, at each time step having the **ability put attention** on the content of one memory element (or a few, with a different weight).


</font>
</div>

# Imports
<hr style = "border:2px solid black" ></hr>

In [13]:
import numpy as np
from scipy.special import softmax

# Attention via `NumPy` and `SciPy`
<hr style = "border:2px solid black" ></hr>

<div class="alert alert-info">
<font color=black>

- Let’s start by first defining the **word embeddings** of the four different words for which we will be calculating the attention. 
    
- In reality, these word embeddings would have been generated by an **encoder**, however for this particular example we shall be defining them manually. 

</font>
</div>

In [3]:
# encoder representations of four different words
word_1 = np.array([1, 0, 0])
word_2 = np.array([0, 1, 0])
word_3 = np.array([1, 1, 0])
word_4 = np.array([0, 0, 1])

<div class="alert alert-info">
<font color=black>

- Let's now generate the weight matrices, which we will eventually be multiplying to the word embeddings to generate the queries, keys and values. 

- Here, we shall be generating these weight matrices randomly, however in actual practice these would **have been learned during training**. 

</font>
</div>

In [7]:
...
# generating the weight matrices
np.random.seed(42) # to allow us to reproduce the same attention values
W_Q = np.random.randint(3, size=(3, 3))
W_K = np.random.randint(3, size=(3, 3))
W_V = np.random.randint(3, size=(3, 3))

<div class="alert alert-info">
<font color=black>

- The query, key and value vectors for each word are generated by multiplying each word embedding by each of the weight matrices. 
    
</font>
</div>

In [8]:
# generating the queries, keys and values
query_1 = word_1 @ W_Q
key_1 = word_1 @ W_K
value_1 = word_1 @ W_V
 
query_2 = word_2 @ W_Q
key_2 = word_2 @ W_K
value_2 = word_2 @ W_V
 
query_3 = word_3 @ W_Q
key_3 = word_3 @ W_K
value_3 = word_3 @ W_V
 
query_4 = word_4 @ W_Q
key_4 = word_4 @ W_K
value_4 = word_4 @ W_V

<div class="alert alert-info">
<font color=black>

- Considering only the first word for the time being, the next step scores its query vector against all of the key vectors using a dot product operation. 

</font>
</div>

In [11]:
# scoring the first query vector against all key vectors
scores = np.array([np.dot(query_1, key_1), np.dot(query_1, key_2),
               np.dot(query_1, key_3), np.dot(query_1, key_4)])

<div class="alert alert-info">
<font color=black>

- The score values are subsequently passed through a softmax operation to generate the weights. 
    
- Before doing so, it is **common practice** to divide the score values by the square root of the dimensionality of the key vectors (in this case, three), to keep the gradients stable. 

</font>
</div>

In [14]:
# computing the weights by a softmax operation
weights = softmax(scores / key_1.shape[0] ** 0.5)

<div class="alert alert-info">
<font color=black>

- Finally, the attention output is calculated by a weighted sum of all four value vectors. 

</font>
</div>

In [15]:
# computing the attention by a weighted sum of the value vectors
attention = (weights[0] * value_1) + (weights[1] * value_2) + \
    (weights[2] * value_3) + (weights[3] * value_4)

print(attention)

[0.98522025 1.74174051 0.75652026]


# References
<hr style = "border:2px solid black" ></hr>

<div class="alert alert-warning">
<font color=black>

- https://machinelearningmastery.com/what-is-attention/
- https://machinelearningmastery.com/the-attention-mechanism-from-scratch/

</font>
</div>