# The Attention mechanism and its implementation in Keras

Computing self-attention of a sentence with GloVe embeddings and the `MultiHeadAttention` class in Keras

Documents used to write the notebook:
* Chollet's book, _Deep Learning with Python_, Second Edition, 2021, pp. 339-342
* The Keras documentation: https://keras.io/api/layers/attention_layers/multi_head_attention/#multiheadattention-layer
* Tensorflow documentation: https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention
* Implementation: https://github.com/keras-team/keras/blob/v2.7.0/keras/layers/multi_head_attention.py

Author: Pierre Nugues

## Modules

In [1]:
import numpy as np
import numpy.linalg
from scipy.special import softmax
import tensorflow as tf

## Noncontextual embeddings

We load GloVe

In [2]:
def load(file):
    """
    Return the embeddings in the from of a dictionary
    :param file:
    :return:
    """
    file = file
    embeddings = {}
    glove = open(file)
    for line in glove:
        values = line.strip().split()
        word = values[0]
        vector = np.array(values[1:], dtype='float32')
        embeddings[word] = vector
    glove.close()
    embeddings_dict = embeddings
    embedded_words = sorted(list(embeddings_dict.keys()))
    return embeddings_dict

In [3]:
embedding_file = '/Users/pierre/Documents/Cours/EDAN20/corpus/glove.6B.50d.txt'
embeddings_dict = load(embedding_file)

In [4]:
embeddings_dict['ship']

array([ 1.5213  ,  0.10522 ,  0.38162 , -0.50801 ,  0.032423, -0.13484 ,
       -1.2474  ,  0.79813 ,  0.84691 , -1.101   ,  0.88743 ,  1.3749  ,
        0.42928 ,  0.65717 , -0.2636  , -0.41759 , -0.48846 ,  0.91061 ,
       -1.7158  , -0.438   ,  0.78395 ,  0.19636 , -0.40657 , -0.53971 ,
        0.82442 , -1.7434  ,  0.14285 ,  0.28037 ,  1.1688  ,  0.16897 ,
        2.2271  , -0.58273 , -0.45723 ,  0.62814 ,  0.54441 ,  0.28462 ,
        0.44485 , -0.55343 , -0.36493 , -0.016425,  0.40876 , -0.87148 ,
        1.5513  , -0.80704 , -0.10036 , -0.28461 , -0.33216 , -0.50609 ,
        0.48272 , -0.66198 ], dtype=float32)

## Cosine similarity

Let us compute the cosine similarity of the words in a sentence:
> I must go back to my ship and to my crew

_Odyssey_, book I 

Remeber that:
$$\cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}|| \cdot ||\mathbf{v} ||}$$

In [5]:
sentence = 'I must go back to my ship and to my crew'

In [6]:
words = sentence.lower().split()
words

['i', 'must', 'go', 'back', 'to', 'my', 'ship', 'and', 'to', 'my', 'crew']

We build the embedding matrix

In [7]:
embeddings_seq = []
for word in words:
    embeddings_seq += [embeddings_dict[word]]
embeddings_seq = np.array(embeddings_seq)
#embeddings_seq

We compute the attention scores as the pairwise cosines of the word embeddings

In [8]:
attn_scores_cos = np.zeros((len(words),len(words)))
for i in range(len(words)):
    scores = np.zeros(len(words))
    for j in range(len(words)):
        scores[j] = (np.dot(embeddings_seq[i], embeddings_seq[j])/
                     (np.linalg.norm(embeddings_seq[i]) * 
                      np.linalg.norm(embeddings_seq[j])))
        #scores[j] = np.dot(embeddings_dict[words[i]], embeddings_dict[words[j]])
    attn_scores_cos[i] = scores

In [9]:
print('\t', end='')
for i in range(len(words)):
    print(words[i], end='\t')
print()

for i in range(attn_scores_cos.shape[0]):
    print(words[i], end='\t')
    for j in range(attn_scores_cos.shape[1]):
        print(f"{attn_scores_cos[i,j]:.2f}", end='\t')
    print()

	i	must	go	back	to	my	ship	and	to	my	crew	
i	1.00	0.75	0.86	0.76	0.73	0.90	0.35	0.65	0.73	0.90	0.42	
must	0.75	1.00	0.85	0.68	0.87	0.69	0.42	0.69	0.87	0.69	0.45	
go	0.86	0.85	1.00	0.84	0.84	0.81	0.41	0.68	0.84	0.81	0.49	
back	0.76	0.68	0.84	1.00	0.83	0.76	0.49	0.77	0.83	0.76	0.51	
to	0.73	0.87	0.84	0.83	1.00	0.68	0.54	0.86	1.00	0.68	0.51	
my	0.90	0.69	0.81	0.76	0.68	1.00	0.38	0.63	0.68	1.00	0.44	
ship	0.35	0.42	0.41	0.49	0.54	0.38	1.00	0.46	0.54	0.38	0.78	
and	0.65	0.69	0.68	0.77	0.86	0.63	0.46	1.00	0.86	0.63	0.49	
to	0.73	0.87	0.84	0.83	1.00	0.68	0.54	0.86	1.00	0.68	0.51	
my	0.90	0.69	0.81	0.76	0.68	1.00	0.38	0.63	0.68	1.00	0.44	
crew	0.42	0.45	0.49	0.51	0.51	0.44	0.78	0.49	0.51	0.44	1.00	


## Contextual embeddings

We design a new vector representation for _ship_ so that it receives an influence from _crew_ and the other words of its context. This influence will depend on the embeddings from te context. Let us use the cosine similarities as attention scores

In [10]:
attn_scores_cos[6]

array([0.34663907, 0.41782767, 0.40681112, 0.48531651, 0.54014385,
       0.3791028 , 1.        , 0.45863339, 0.54014385, 0.3791028 ,
       0.78480232])

We compute the new embeddings as the sum of the noncontextual embeddings weighted by the cosine similarity. We have contextual embeddings.

In [11]:
new_embeddings_ship = (0.35 * embeddings_dict['i'] + 
                  0.42 * embeddings_dict['must'] + 
                  0.41 * embeddings_dict['go'] +
                  0.49 * embeddings_dict['back'] +
                  0.54 * embeddings_dict['to'] + 
                  0.38 * embeddings_dict['my'] +
                  1.00 * embeddings_dict['ship'] +
                  0.46 * embeddings_dict['and'] +
                  0.54 * embeddings_dict['to'] +
                  0.38 * embeddings_dict['my'] +
                  0.78 * embeddings_dict['crew'])
new_embeddings_ship

array([  3.2289004 ,   0.6421813 ,   1.4712307 ,  -2.3537598 ,
         2.24136   ,  -0.42374972,  -4.105233  ,   2.6215937 ,
         0.17187847,  -2.4323788 ,   1.3882339 ,   3.7241364 ,
        -1.9721073 ,   1.1893367 ,   2.2511206 ,   0.9501926 ,
        -0.76461965,   1.0288985 ,  -3.0553396 ,  -3.6306143 ,
         0.8304751 ,   2.9298651 ,   1.3221488 ,  -0.70915157,
         2.9745216 , -10.595905  ,  -1.3167882 ,   0.20589754,
         3.5456927 ,  -2.7711318 ,  18.2672    ,   2.4816926 ,
        -3.588689  ,   0.32967418,   1.2717707 ,   0.653944  ,
         1.5873263 ,   0.01946718,   0.7724056 ,  -1.4620132 ,
        -0.2066631 ,  -1.2463707 ,   2.1504393 ,  -0.18107067,
        -0.5025929 ,  -0.2888131 ,  -0.5059958 ,  -1.9675692 ,
        -0.06049497,  -0.6725442 ], dtype=float32)

Exact computation with numpy

In [12]:
(attn_scores_cos @ embeddings_seq)[6]

array([ 3.23191333e+00,  6.40820291e-01,  1.47175971e+00, -2.34335986e+00,
        2.23580736e+00, -4.18774560e-01, -4.10024511e+00,  2.62113565e+00,
        1.80098590e-01, -2.43597248e+00,  1.39229628e+00,  3.71878745e+00,
       -1.96033551e+00,  1.19803266e+00,  2.23935332e+00,  9.37625410e-01,
       -7.70491710e-01,  1.03488285e+00, -3.06148983e+00, -3.62586930e+00,
        8.34011570e-01,  2.92812823e+00,  1.31648467e+00, -7.13029835e-01,
        2.96666448e+00, -1.05669432e+01, -1.30994248e+00,  2.02828896e-01,
        3.53620975e+00, -2.75710038e+00,  1.82203369e+01,  2.46983156e+00,
       -3.58043840e+00,  3.26042563e-01,  1.27600905e+00,  6.57010953e-01,
        1.58887761e+00,  1.15708439e-02,  7.66195274e-01, -1.45595292e+00,
       -2.03622785e-01, -1.24835755e+00,  2.15496249e+00, -1.87666416e-01,
       -5.02529457e-01, -2.91283600e-01, -5.10062909e-01, -1.95960099e+00,
       -5.88529337e-02, -6.73801397e-01])

## Self-attention

Vaswani et al. (2017) defined attention as:
$$
\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{Q}) = \text{softmax}(\frac{\mathbf{Q}  \mathbf{K}^\intercal}{\sqrt{d_k}})  \mathbf{V},
$$
where
$$
\begin{array}{lcl}
\mathbf{Q} &=& \mathbf{X} \mathbf{W}_Q,   \\
\mathbf{K} &=& \mathbf{X} \mathbf{W}_K , \\
\mathbf{V} &=& \mathbf{X} \mathbf{W}_V.\\
\end{array}
$$
and $\mathbf{X}$ represents complete input sequence (all the tokens).

$d_k$ is the dimension of the input and $\sqrt{d_k}$ a scaling factor. The $\text{softmax}$ function is defined as:
$$
\text{softmax}(x_1, x_2, ..., x_j, ..., x_n) = (\frac{e^{x_1}}{\sum_{i=1}^n e^{x_i}}, \frac{e^{x_2}}{\sum_{i=1}^n e^{x_i}}, ..., \frac{e^{x_j}}{\sum_{i=1}^n e^{x_i}}, ..., \frac{e^{x_n}}{\sum_{i=1}^n e^{x_i}})
$$

We omit the weight matrices and we use the same embeddings for $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{Q}$: GloVe embeddings

For the matrix above, self attention, $\text{softmax}(\frac{\mathbf{Q}  \mathbf{K}^\intercal}{\sqrt{d_k}})$,  for _ship_ yields:

In [13]:
attn_scores = softmax((embeddings_seq @ embeddings_seq.T)/
                      np.sqrt(embeddings_dict['i'].shape), axis=-1)

The scaled and normalized attention scores

In [14]:
print('\t', end='')
for i in range(len(words)):
    print(words[i], end='\t')
print()
for i in range(attn_scores.shape[0]):
    print(words[i], end='\t')
    for j in range(attn_scores.shape[1]):
        print(f"{attn_scores[i,j]:.2f}", end='\t')
    print()

	i	must	go	back	to	my	ship	and	to	my	crew	
i	0.36	0.05	0.07	0.05	0.04	0.19	0.01	0.02	0.04	0.19	0.01	
must	0.14	0.20	0.10	0.06	0.11	0.10	0.03	0.05	0.11	0.10	0.02	
go	0.18	0.09	0.14	0.09	0.08	0.13	0.02	0.04	0.08	0.13	0.02	
back	0.14	0.05	0.09	0.19	0.08	0.12	0.03	0.06	0.08	0.12	0.03	
to	0.11	0.11	0.09	0.09	0.15	0.08	0.04	0.07	0.15	0.08	0.03	
my	0.19	0.03	0.05	0.04	0.03	0.29	0.01	0.02	0.03	0.29	0.01	
ship	0.03	0.03	0.03	0.04	0.05	0.03	0.55	0.03	0.05	0.03	0.13	
and	0.10	0.08	0.07	0.10	0.12	0.09	0.04	0.15	0.12	0.09	0.04	
to	0.11	0.11	0.09	0.09	0.15	0.08	0.04	0.07	0.15	0.08	0.03	
my	0.19	0.03	0.05	0.04	0.03	0.29	0.01	0.02	0.03	0.29	0.01	
crew	0.06	0.05	0.05	0.06	0.05	0.06	0.21	0.04	0.05	0.06	0.31	


For _ship:_

In [15]:
attn_scores[6]

array([0.03030739, 0.03024587, 0.02764812, 0.0406623 , 0.04593486,
       0.03426556, 0.55297636, 0.02968701, 0.04593486, 0.03426556,
       0.12807212])

We have the weights of 55% for _ship_ and 13% for _crew_, the rest from the other words.

And the new contextual embedding is for _ship_ is a linear combination:

In [16]:
self_attention_ship = (0.03 * embeddings_dict['i'] + 
                  0.03 * embeddings_dict['must'] + 
                  0.03 * embeddings_dict['go'] +
                  0.04 * embeddings_dict['back'] +
                  0.05 * embeddings_dict['to'] + 
                  0.03 * embeddings_dict['my'] +
                  0.55 * embeddings_dict['ship'] +
                  0.03 * embeddings_dict['and'] +
                  0.05 * embeddings_dict['to'] +
                  0.03 * embeddings_dict['my'] +
                  0.13 * embeddings_dict['crew'])
self_attention_ship

array([ 1.044195  ,  0.09659944,  0.34672633, -0.42381316,  0.22031876,
       -0.09556399, -0.9915037 ,  0.6637363 ,  0.436829  , -0.794322  ,
        0.5639492 ,  0.98379046,  0.02403222,  0.5065729 ,  0.07323891,
       -0.17404956, -0.3321709 ,  0.561386  , -1.1613255 , -0.5717251 ,
        0.43559432,  0.41197652, -0.06589289, -0.33361682,  0.6578553 ,
       -1.7420686 , -0.03438139,  0.14395224,  0.8546864 , -0.14299722,
        2.6613998 , -0.05529932, -0.537614  ,  0.3057363 ,  0.40678102,
        0.22314468,  0.39586747, -0.29400417, -0.11625631, -0.13404053,
        0.17093891, -0.533202  ,  0.9551976 , -0.41781372, -0.10581654,
       -0.17152235, -0.22509809, -0.39232036,  0.20977767, -0.3625378 ],
      dtype=float32)

Exact and complete computation of the whole matrix with numpy of $\text{softmax}(\frac{\mathbf{Q}  \mathbf{K}^\intercal}{\sqrt{d_k}})  \mathbf{V}$:

In [17]:
self_attention = attn_scores @ embeddings_seq

For _ship:_ 

In [18]:
self_attention[6]

array([ 1.0387307 ,  0.10328993,  0.34260821, -0.43199392,  0.2236531 ,
       -0.09583739, -0.99261578,  0.66616271,  0.4423922 , -0.7941921 ,
        0.56381569,  0.99210325,  0.02053569,  0.50824176,  0.07430461,
       -0.17727756, -0.34077056,  0.56745014, -1.15453678, -0.57175906,
        0.42881261,  0.41905235, -0.06575875, -0.33385476,  0.66821521,
       -1.74733617, -0.04854005,  0.15311078,  0.86423044, -0.14474712,
        2.65712497, -0.05447188, -0.53432358,  0.31597919,  0.40407802,
        0.22768488,  0.39576729, -0.29159369, -0.11262517, -0.13846768,
        0.1743577 , -0.53750545,  0.94985317, -0.41448157, -0.10386168,
       -0.17551405, -0.22132868, -0.39945137,  0.21188122, -0.36097601])

## Chollet's implementation

Now we follow Chollet's book (2021), page 339, to outline the computation. The function below is drawn his the book, page 339, and is slightly modified.

In [19]:
def self_attention(input_sequence):
    output = np.zeros(shape=input_sequence.shape) 
    attn_scores = []
    # The output will consist of contextual embeddinsgs of the same shape
    for i, pivot_vector in enumerate(input_sequence):
        scores = np.zeros(shape=(len(input_sequence),)) 
        for j, vector in enumerate(input_sequence):
            scores[j] = np.dot(pivot_vector, vector.T) # Q K^T
        scores /= np.sqrt(input_sequence.shape[1]) # sqrt(d_k)
        scores = softmax(scores) # softmax(Q K^T / sqrt(d_k))
        attn_scores += [scores]
        new_pivot_representation = np.zeros(shape=pivot_vector.shape) 
        for j, vector in enumerate(input_sequence):
             new_pivot_representation += vector * scores[j]
        output[i] = new_pivot_representation
    return output, np.array(attn_scores)

As input sequence, we use the GloVe embeddings of the words again:

In [20]:
words

['i', 'must', 'go', 'back', 'to', 'my', 'ship', 'and', 'to', 'my', 'crew']

In [21]:
embeddings_dict['ship']

array([ 1.5213  ,  0.10522 ,  0.38162 , -0.50801 ,  0.032423, -0.13484 ,
       -1.2474  ,  0.79813 ,  0.84691 , -1.101   ,  0.88743 ,  1.3749  ,
        0.42928 ,  0.65717 , -0.2636  , -0.41759 , -0.48846 ,  0.91061 ,
       -1.7158  , -0.438   ,  0.78395 ,  0.19636 , -0.40657 , -0.53971 ,
        0.82442 , -1.7434  ,  0.14285 ,  0.28037 ,  1.1688  ,  0.16897 ,
        2.2271  , -0.58273 , -0.45723 ,  0.62814 ,  0.54441 ,  0.28462 ,
        0.44485 , -0.55343 , -0.36493 , -0.016425,  0.40876 , -0.87148 ,
        1.5513  , -0.80704 , -0.10036 , -0.28461 , -0.33216 , -0.50609 ,
        0.48272 , -0.66198 ], dtype=float32)

In [22]:
embeddings_seq.shape

(11, 50)

We compute the new embeddings and the attentions scores. The result is a pair.

In [23]:
attn_output = self_attention(embeddings_seq)

In [24]:
(len(attn_output), 
 attn_output[0].shape, 
 attn_output[1].shape)

(2, (11, 50), (11, 11))

Attention scores for _ship_

In [25]:
attn_output[1][6]

array([0.0303074 , 0.03024588, 0.02764812, 0.04066233, 0.04593488,
       0.03426557, 0.55297623, 0.02968703, 0.04593488, 0.03426557,
       0.12807209])

The new contextual embeddings for _ship:_

In [26]:
attn_output[0][6]

array([ 1.03873055,  0.10328993,  0.34260817, -0.43199395,  0.22365316,
       -0.09583739, -0.99261573,  0.66616262,  0.44239205, -0.79419196,
        0.56381559,  0.99210317,  0.02053557,  0.50824163,  0.07430472,
       -0.17727741, -0.3407705 ,  0.56744999, -1.1545366 , -0.57175908,
        0.42881253,  0.41905238, -0.06575866, -0.33385471,  0.66821517,
       -1.74733627, -0.0485401 ,  0.15311076,  0.86423036, -0.14474725,
        2.65712533, -0.05447172, -0.53432362,  0.31597912,  0.40407797,
        0.22768482,  0.39576725, -0.29159359, -0.11262507, -0.13846773,
        0.17435764, -0.53750533,  0.94985298, -0.41448147, -0.10386168,
       -0.17551402, -0.22132863, -0.39945136,  0.21188116, -0.36097596])

## Keras implementation
 
Keras has an implementation of self-attention encapsulated in the `MultiHeadAttention` class. Before going to the attention module, the query, key value, goes through a dense layer. The output also goes through a dense layer (missing from Chollet's book, page 342). These three layers are initialized with Glorot's algorithm.

In [27]:
from tensorflow.keras.layers import MultiHeadAttention

att_layer = MultiHeadAttention(num_heads=1, 
                               key_dim=50, 
                               #kernel_initializer=tf.keras.initializers.ones(),
                               use_bias=False, 
                               attention_axes=(1,))

In [28]:
np.array([embeddings_seq]).shape

(1, 11, 50)

In [29]:
(attn_output, attn_scores) = att_layer(np.array([embeddings_seq]), 
          np.array([embeddings_seq]),
          np.array([embeddings_seq]),
         return_attention_scores=True)

2021-11-22 20:18:59.923568: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


The attention score for _ship:_

In [30]:
attn_scores[0][0][6]

<tf.Tensor: shape=(11,), dtype=float32, numpy=
array([0.09334587, 0.09094715, 0.09149163, 0.09054147, 0.09002435,
       0.09244619, 0.08981434, 0.08916097, 0.09002435, 0.09244619,
       0.08975761], dtype=float32)>

### The initial dense layers

The weight initial values with the 4 matrices

In [31]:
w_init = att_layer.weights
len(w_init)

4

In [32]:
w_init

[<tf.Variable 'multi_head_attention/query/kernel:0' shape=(50, 1, 50) dtype=float32, numpy=
 array([[[ 0.0235358 , -0.04158013,  0.02329776, ...,  0.03353316,
          -0.04693759, -0.01708426]],
 
        [[ 0.04184161,  0.02855625, -0.04137975, ...,  0.0118022 ,
           0.01659506, -0.03278984]],
 
        [[ 0.02323494,  0.02261272,  0.01935604, ..., -0.04029571,
           0.04366339,  0.01383097]],
 
        ...,
 
        [[ 0.03462507,  0.00065291, -0.01841628, ..., -0.00940876,
          -0.04844818, -0.02721525]],
 
        [[ 0.01191358,  0.00262818,  0.03573297, ...,  0.02754027,
           0.00291654, -0.03835024]],
 
        [[ 0.01706018, -0.00571505, -0.03109601, ...,  0.04206156,
          -0.01439562, -0.01343737]]], dtype=float32)>,
 <tf.Variable 'multi_head_attention/key/kernel:0' shape=(50, 1, 50) dtype=float32, numpy=
 array([[[ 1.3113622e-02, -3.7193641e-02, -3.1759270e-02, ...,
           1.3093915e-02, -4.5989122e-02, -2.1870652e-02]],
 
        [[ 4.4183061

### By-passing the dense layers

We create identity matrices to pass through the dense layers and recover the attention values and scores

In [33]:
i_50 = np.identity(50)

In [34]:
w_pt_50 = [i_50.reshape(50, 1, 50) for _ in range(3)] + [i_50.reshape(1, 50, 50)]

We set the new weights (pass through)

In [35]:
att_layer.set_weights(w_pt_50)

### Multihead attention without the dense layers

We obtain now the same results as the `self_attention()` function for _ship:_

The attention scores for _ship:_

In [36]:
att_layer(np.array([embeddings_seq]), 
          np.array([embeddings_seq]),
          np.array([embeddings_seq]),
         return_attention_scores=True)[1][0][0][6]

<tf.Tensor: shape=(11,), dtype=float32, numpy=
array([0.03030742, 0.03024589, 0.02764813, 0.04066233, 0.04593488,
       0.03426558, 0.55297625, 0.02968703, 0.04593488, 0.03426558,
       0.12807208], dtype=float32)>

The attention vector for _ship:_

In [37]:
att_layer(np.array([embeddings_seq]), 
          np.array([embeddings_seq]),
          np.array([embeddings_seq]),
         return_attention_scores=True)[0][0][6]

<tf.Tensor: shape=(50,), dtype=float32, numpy=
array([ 1.0387306 ,  0.10328995,  0.34260815, -0.43199396,  0.22365318,
       -0.09583741, -0.99261564,  0.6661626 ,  0.44239208, -0.79419196,
        0.56381553,  0.9921032 ,  0.02053553,  0.5082416 ,  0.07430474,
       -0.17727737, -0.34077048,  0.56745   , -1.1545366 , -0.5717591 ,
        0.42881253,  0.41905236, -0.06575862, -0.3338547 ,  0.6682152 ,
       -1.7473364 , -0.04854015,  0.15311077,  0.8642304 , -0.1447473 ,
        2.6571252 , -0.05447169, -0.5343237 ,  0.31597912,  0.40407795,
        0.22768481,  0.39576727, -0.29159355, -0.11262506, -0.13846776,
        0.17435764, -0.5375053 ,  0.9498529 , -0.41448146, -0.10386168,
       -0.17551401, -0.22132863, -0.3994514 ,  0.21188116, -0.36097592],
      dtype=float32)>

## Test with a simple matrix

Three words, dimension of embeddings: 4

In [38]:
test_input_sequence = np.array([[[1.0, 0.0, 0.0, 1.0],
                                 [0.0, 1.5, 1.0, 1.0],
                                 [0.0, 1.0, 1.0, 1.0]]])

In [39]:
test_input_sequence.shape

(1, 3, 4)

### Self-attention from the book

In [40]:
self_attention(test_input_sequence[0])

(array([[0.45186276, 0.68517155, 0.54813724, 1.        ],
        [0.10450673, 1.16085775, 0.89549327, 1.        ],
        [0.13872271, 1.10337221, 0.86127729, 1.        ]]),
 array([[0.45186276, 0.27406862, 0.27406862],
        [0.10450673, 0.53072895, 0.36476432],
        [0.13872271, 0.48418985, 0.37708743]]))

### Multihead attention from Keras

In [41]:
att_layer = MultiHeadAttention(num_heads=1, 
                               key_dim=4, 
                               #kernel_initializer=tf.keras.initializers.ones(), #my_init,
                               use_bias=False, attention_axes=(1,))

The multihead attention uses a Glorot initialization of the dense layers. The results will be different for those of `self_attention()`

In [42]:
att_layer(test_input_sequence, 
          test_input_sequence,
          test_input_sequence,
         return_attention_scores=True)

(<tf.Tensor: shape=(1, 3, 4), dtype=float32, numpy=
 array([[[-0.3766441 ,  0.33481684, -0.01428649,  0.14875287],
         [-0.37168398,  0.3519954 ,  0.00520262,  0.16505039],
         [-0.37213954,  0.34879547,  0.00183475,  0.16199642]]],
       dtype=float32)>,
 <tf.Tensor: shape=(1, 1, 3, 3), dtype=float32, numpy=
 array([[[[0.3819979 , 0.30329558, 0.31470647],
          [0.28096676, 0.35826197, 0.3607713 ],
          [0.29804945, 0.34727123, 0.3546793 ]]]], dtype=float32)>)

Weights of the dense layers

In [43]:
att_layer.weights #att_layer.get_weights()

[<tf.Variable 'multi_head_attention_1/query/kernel:0' shape=(4, 1, 4) dtype=float32, numpy=
 array([[[ 0.26624137, -0.12857488,  0.21427745,  0.12562203]],
 
        [[ 0.07248586, -0.01270992, -0.3568948 , -0.03817797]],
 
        [[-0.5223397 ,  0.29709953,  0.06554359,  0.5066794 ]],
 
        [[-0.33860448,  0.0512777 ,  0.01473051, -0.35103783]]],
       dtype=float32)>,
 <tf.Variable 'multi_head_attention_1/key/kernel:0' shape=(4, 1, 4) dtype=float32, numpy=
 array([[[ 0.40734494, -0.26249927,  0.1891346 , -0.33850628]],
 
        [[ 0.3839069 ,  0.38551158, -0.26628643,  0.12939876]],
 
        [[ 0.29185414,  0.13014114, -0.4210621 ,  0.00790733]],
 
        [[ 0.26130402, -0.13597435,  0.5441804 ,  0.21766269]]],
       dtype=float32)>,
 <tf.Variable 'multi_head_attention_1/value/kernel:0' shape=(4, 1, 4) dtype=float32, numpy=
 array([[[-0.4021383 , -0.25917742, -0.2872401 , -0.16896671]],
 
        [[-0.1219289 ,  0.46332443,  0.423333  , -0.54266614]],
 
        [[ 0.2242751

### By-passing the dense layers

We use weights of identity matrices

In [44]:
i_4 = np.identity(4)
w_pt_4 = [i_4.reshape(4, 1, 4) for _ in range(3)] + [i_4.reshape(1, 4, 4)]

In [45]:
w_pt_4

[array([[[1., 0., 0., 0.]],
 
        [[0., 1., 0., 0.]],
 
        [[0., 0., 1., 0.]],
 
        [[0., 0., 0., 1.]]]),
 array([[[1., 0., 0., 0.]],
 
        [[0., 1., 0., 0.]],
 
        [[0., 0., 1., 0.]],
 
        [[0., 0., 0., 1.]]]),
 array([[[1., 0., 0., 0.]],
 
        [[0., 1., 0., 0.]],
 
        [[0., 0., 1., 0.]],
 
        [[0., 0., 0., 1.]]]),
 array([[[1., 0., 0., 0.],
         [0., 1., 0., 0.],
         [0., 0., 1., 0.],
         [0., 0., 0., 1.]]])]

We set these weights

In [46]:
att_layer.set_weights(w_pt_4)

Now we have the same results as with `self_attention()`

In [47]:
att_layer(test_input_sequence, 
          test_input_sequence,
          test_input_sequence,
         return_attention_scores=True)

(<tf.Tensor: shape=(1, 3, 4), dtype=float32, numpy=
 array([[[0.45186275, 0.6851716 , 0.54813725, 1.        ],
         [0.10450672, 1.1608577 , 0.89549327, 1.        ],
         [0.13872272, 1.1033722 , 0.86127734, 1.        ]]], dtype=float32)>,
 <tf.Tensor: shape=(1, 1, 3, 3), dtype=float32, numpy=
 array([[[[0.45186275, 0.27406862, 0.27406862],
          [0.10450672, 0.53072894, 0.3647643 ],
          [0.13872272, 0.48418987, 0.37708744]]]], dtype=float32)>)