## An example of the attention mechanism in the transformer model

Presentation LLMs as hold in the lecture. Note the orginial code for the [bbs presentation](https://ioskn.github.io/bbs/) is `creating_similarity_matrix_as_in_bbs`. We are looking at the attention from 'friend' back to 'Emma' in the sentence:


**Emma** hates games but she is a great **friend**

##### How I made the example

<small>
I want to explain the attention mechanism in LLM and came up with the following story

An example: Emma hates games but she is a  great friend
Imaging the network at a stage where it has to figure out (among other thinks) the relationships of objects
This could be done in the following space
Dim1:	Score that the word is a person
Dim2: 	Score that the word is animal
Dim3: 	Score that the word is a noun
Dim4: 	Score that the word is an adjectiv
Examples 
Emma k_1=(1.2, 0.8, 1.0,−2) this is called a key
Now, we take the word friend, the word itself might be (1.0.,0.9, 1.0,−2) but it’s better to ask to what friend should look at (persons, animals and adjectives). This is the query q_9=(1.0,0.9, 0,1) 
We get to the space via
𝑘_𝑗=𝑊^𝐾 𝑥_𝑗
𝑞_𝑖=𝑊^𝑄 𝑥_𝑖
The similarity between 𝑖 and 𝑗 is the dot-product between 𝑞_𝑖 and 𝑘_𝑖
For q_9,〖 k〗_1=1.2∗1+0.8∗0.9+1.0∗0+(−2)∗1

Can you give me plausible examples for the keys and values of that sentence, besides the one I provided.
</small>

In [67]:
# Importing the required libraries
import numpy as np
import pandas as pd

# Defining the keys and queries dictionaries
keys = {
    'Emma': np.array([1.2, 0.8, 1.0, 0]),
    'hates': np.array([0, 0, -0.5, 0]),
    'games': np.array([0, 0, 1.0, -1]),
    'but': np.array([0, 0, -1, 0]),
    'she': np.array([1.0, 0.0, 0, -1]),
    'is': np.array([0, 0, -0.5, 0]),
    'a': np.array([0, 0, -1, 0]),
    'great': np.array([0, 0, -1, 1.2]),
    'friend': np.array([0,0, 1.0,0]) # The word friend is not a person or an animal
}

queries = {
    'Emma': np.array([1.1, 0.7, 0.9, 0]),
    'hates': np.array([0, 0, 1, 0]),
    'games': np.array([0, 0, 1, 0]),
    'but': np.array([0, 0, 0, 0]),
    'she': np.array([1.1, 0, 0, -1]),
    'is': np.array([0, 0, 0, 0]),
    'a': np.array([0.2, 0, 1, 0.5]),
    'great': np.array([0.5, 0, 1, 0]),
    'friend': np.array([1, 0.9, 0.5, 1])
}

### Similarity space

We make the comparison in a 4-dimensional space. The dimensions are:
  -  1 Person
  -  2 Animal
  -  3 Noun
  -  4 Adjective

Both the key and the query are represented in this space. The similarity between the key and the query is the dot product between the two vectors.

### The keys K

The keys quantify the importance of the dimensions for the word. For example, the key for Emma is `(1.2, 0.8, 1.0, 0)`. Emma might be a person, an animal, a noun, but not an adjective. The keys for all words are given in the key matrix K. Note that our keys correpspond to a $(C,T)$ matrix, the attetion is all you need paper they used $K^T$ a $(T,C)$ matrix.

In [68]:
df_keys = pd.DataFrame.from_dict(keys, orient='index', columns=['Pers', 'Anim', 'Noun', 'Adj'])
K= df_keys.values.T
print("Shape of K: ", K.shape)
df_keys.T

Shape of K:  (4, 9)


Unnamed: 0,Emma,hates,games,but,she,is,a,great,friend
Pers,1.2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
Anim,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Noun,1.0,-0.5,1.0,-1.0,0.0,-0.5,-1.0,-1.0,1.0
Adj,0.0,0.0,-1.0,0.0,-1.0,0.0,0.0,1.2,0.0


### The querries Q

Let's look at the last word "friend". The word itself is a noun and thus has a key `(0, 0, 1.0, 0)`. But we ask to what friend should look at / attent to (persons, animals and adjectives). This is the query `q_9=(1.0, 0.9, 0.5, 1)`.

In [9]:
# Create a DataFrame to display the queries
df_queries = pd.DataFrame.from_dict(queries, orient='index', columns=['Pers', 'Anim', 'Noun', 'Adj'])
Q = df_queries.values.T
Q.shape #Note that Q in the attention is all you need paper $Q^T$ a matrix of shape (9, 4) 
print("Shape of Q: ", Q.shape)
df_queries.T

Shape of Q:  (4, 9)


Unnamed: 0,Emma,hates,games,but,she,is,a,great,friend
Pers,1.1,0.0,0.0,0.0,1.1,0.0,0.2,0.5,1.0
Anim,0.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9
Noun,0.9,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.5
Adj,0.0,0.0,0.0,0.0,-1.0,0.0,0.5,0.0,1.0


#### Attention of friend to Emma
Query friend: `(1.0, 0.9, 0.5, 1)` 
Key Emma: `(1.2, 0.8, 1.0, 0)`
Similarity: `1.2*1.0 + 0.8*0.9 + 1.0*0.5 + 0*1 = 2.42`

In the example the attention of the word "friend" to the word "Emma" is calculated as follows:

In [46]:
K[:,0] @ Q[:,8] # This is the dot-product of the query of the word friend Q[:,8] and the key of the word emma 

2.42

### The Attention Matrix

The attention matrix is a matrix of similarities between the keys and the queries. As in the example above we consider queries $Q_{if}$ with $i=1,2,3,4$ and the from index $f=1,2,\ldots, T$ with keys $K_{it}$ index by the to index $t=1,\ldots,T$. The attention matrix is then given by

$$
\sum_{i=1}^{4} K_{it} Q_{if}
$$

The resulting matrix is a $T \times T$ matrix $T_{ft}$ where the entry $T_{ft}$ is the similarity between the query $f$ (from) and the key $t$ (to). Sums like this can be conveniently described by Einstein sum convention as

In [65]:
# Computing the similarity matrix
wtilde = np.einsum('it,if->ft', K,Q)
np.round(wtilde,2)

array([[ 2.78, -0.45,  0.9 , -0.9 ,  1.1 , -0.45, -0.9 , -0.9 ,  0.9 ],
       [ 1.  , -0.5 ,  1.  , -1.  ,  0.  , -0.5 , -1.  , -1.  ,  1.  ],
       [ 1.  , -0.5 ,  1.  , -1.  ,  0.  , -0.5 , -1.  , -1.  ,  1.  ],
       [ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 1.32,  0.  ,  1.  ,  0.  ,  2.1 ,  0.  ,  0.  , -1.2 ,  0.  ],
       [ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 1.24, -0.5 ,  0.5 , -1.  , -0.3 , -0.5 , -1.  , -0.4 ,  1.  ],
       [ 1.6 , -0.5 ,  1.  , -1.  ,  0.5 , -0.5 , -1.  , -1.  ,  1.  ],
       [ 2.42, -0.25, -0.5 , -0.5 ,  0.  , -0.25, -0.5 ,  0.7 ,  0.5 ]])

### In Matrix Multiplication
Note that in our example $Q$ and $K$ are matrices for shape $(T,C)$ where $T$ is the number of tokens and $C$ is the number of dimensions for the similarity space. Often you find the dimensions $(T,C)$.

In [64]:
print(Q.T @ K)
# A
Qp = Q.T #As in the papers
Kp = K.T #As in the papers
Qp @ Kp.T

[[ 2.78 -0.45  0.9  -0.9   1.1  -0.45 -0.9  -0.9   0.9 ]
 [ 1.   -0.5   1.   -1.    0.   -0.5  -1.   -1.    1.  ]
 [ 1.   -0.5   1.   -1.    0.   -0.5  -1.   -1.    1.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 1.32  0.    1.    0.    2.1   0.    0.   -1.2   0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 1.24 -0.5   0.5  -1.   -0.3  -0.5  -1.   -0.4   1.  ]
 [ 1.6  -0.5   1.   -1.    0.5  -0.5  -1.   -1.    1.  ]
 [ 2.42 -0.25 -0.5  -0.5   0.   -0.25 -0.5   0.7   0.5 ]]


array([[ 2.78, -0.45,  0.9 , -0.9 ,  1.1 , -0.45, -0.9 , -0.9 ,  0.9 ],
       [ 1.  , -0.5 ,  1.  , -1.  ,  0.  , -0.5 , -1.  , -1.  ,  1.  ],
       [ 1.  , -0.5 ,  1.  , -1.  ,  0.  , -0.5 , -1.  , -1.  ,  1.  ],
       [ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 1.32,  0.  ,  1.  ,  0.  ,  2.1 ,  0.  ,  0.  , -1.2 ,  0.  ],
       [ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 1.24, -0.5 ,  0.5 , -1.  , -0.3 , -0.5 , -1.  , -0.4 ,  1.  ],
       [ 1.6 , -0.5 ,  1.  , -1.  ,  0.5 , -0.5 , -1.  , -1.  ,  1.  ],
       [ 2.42, -0.25, -0.5 , -0.5 ,  0.  , -0.25, -0.5 ,  0.7 ,  0.5 ]])

In [56]:
def softmax(z, axis=-1):
    exp_z = np.exp(z)  # Subtract max for numerical stability
    sum_exp_z = np.sum(exp_z, axis=axis, keepdims=True)
    softmax_output = exp_z / sum_exp_z
    return softmax_output

In [57]:
w = softmax(wtilde)
np.round(w, 2)
pd.DataFrame(np.round(w, 2), columns=df_queries.index, index=df_keys.index)

Unnamed: 0,Emma,hates,games,but,she,is,a,great,friend
Emma,0.61,0.02,0.09,0.02,0.11,0.02,0.02,0.02,0.09
hates,0.24,0.05,0.24,0.03,0.09,0.05,0.03,0.03,0.24
games,0.24,0.05,0.24,0.03,0.09,0.05,0.03,0.03,0.24
but,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.11
she,0.19,0.05,0.14,0.05,0.41,0.05,0.05,0.02,0.05
is,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.11
a,0.31,0.05,0.15,0.03,0.07,0.05,0.03,0.06,0.24
great,0.35,0.04,0.19,0.03,0.11,0.04,0.03,0.03,0.19
friend,0.58,0.04,0.03,0.03,0.05,0.04,0.03,0.1,0.09


In [58]:
m = Q.shape[0]
wtilde = np.einsum('if,it->ft', Q,K)
T = wtilde.shape[0]
for i in range(0,T):
    for j in range(i+1, T):
        wtilde[i,j] = -np.inf
w = softmax(wtilde/np.sqrt(m))
pd.DataFrame(np.round(w, 2), columns=df_queries.index, index=df_keys.index)

Unnamed: 0,Emma,hates,games,but,she,is,a,great,friend
Emma,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hates,0.68,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0
games,0.4,0.19,0.4,0.0,0.0,0.0,0.0,0.0,0.0
but,0.25,0.25,0.25,0.25,0.0,0.0,0.0,0.0,0.0
she,0.23,0.12,0.2,0.12,0.34,0.0,0.0,0.0,0.0
is,0.17,0.17,0.17,0.17,0.17,0.17,0.0,0.0,0.0
a,0.27,0.11,0.19,0.09,0.13,0.11,0.09,0.0,0.0
great,0.26,0.09,0.19,0.07,0.15,0.09,0.07,0.07,0.0
friend,0.3,0.08,0.07,0.07,0.09,0.08,0.07,0.13,0.12
