In [1]:
!pip install BPEmb

import math
import numpy as np
import tensorflow as tf

from bpemb import BPEmb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting BPEmb
  Downloading bpemb-0.3.4-py3-none-any.whl (19 kB)
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece, BPEmb
Successfully installed BPEmb-0.3.4 sentencepiece-0.1.97


We'll build a transformer from scratch, layer-by-layer. We'll start with the **Multi-Head Self-Attention** layer since that's the most involved bit. Once we have that working, the rest of the model will look familiar if you've been following the course so far.

# Multi-Head Self-Attention

##  Scaled Dot Product Self-Attention


Inside each attention head is a **Scaled Dot Product Self-Attention** operation as we covered in the slides. Given *queries*, *keys*, and *values*, the operation returns a new "mix" of the values.

$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

The following function implements this and also takes a mask to account for padding and for masking future tokens for decoding (i.e. **look-ahead mask**).

In [2]:
def scaled_dot_product_attention(query, key, value, mask=None):
  key_dim = tf.cast(tf.shape(key)[-1], tf.float32)  # dk
  scaled_scores = tf.matmul(query, key, transpose_b=True) / np.sqrt(key_dim)

  if mask is not None:
    scaled_scores = tf.where(mask==0, -np.inf, scaled_scores)

  softmax = tf.keras.layers.Softmax()
  weights = softmax(scaled_scores)
  return tf.matmul(weights, value), weights

In [3]:
# Suppose our *queries*, *keys*, and *values* are each a length of 3 with a dimension of 4.
seq_len = 3
embed_dim = 4

queries = np.random.rand(seq_len, embed_dim)
keys    = np.random.rand(seq_len, embed_dim)
values  = np.random.rand(seq_len, embed_dim)

print("Queries:\n", queries)

Queries:
 [[0.43176425 0.15350907 0.78373318 0.66364481]
 [0.13514988 0.79738097 0.0792238  0.48807841]
 [0.69380604 0.0067362  0.92388612 0.56572229]]


In [5]:
# self-attention output and weights
output, attn_weights = scaled_dot_product_attention(queries, keys, values)

print("Output\n", output, "\n")
print("Weights\n", attn_weights)

Output
 tf.Tensor(
[[0.48552772 0.32565042 0.4158109  0.39738902]
 [0.52122736 0.27932334 0.40136355 0.38737875]
 [0.4752882  0.34118706 0.41985312 0.3984257 ]], shape=(3, 4), dtype=float32) 

Weights
 tf.Tensor(
[[0.34662864 0.3022894  0.3510819 ]
 [0.2747659  0.37801412 0.34721997]
 [0.37168592 0.2791511  0.34916294]], shape=(3, 3), dtype=float32)


## Generating queries, keys, and values for multiple heads.

Now that we have a way to calculate self-attention, let's actually generate the input *queries*, *keys*, and *values* for multiple heads.

In the slides (and in most references), each attention head had its <u>own separate</u> set of *query*, *key*, and *value* weights. Each weight matrix was of dimension $d\ x \ d/h$ where h was the number of heads. 

![](https://drive.google.com/uc?export=view&id=1SLWkHQgy4nQPFvvjG5_V8UTtpSAJ2zrr)

It's easier to understand things this way and we can certainly code it this way as well. But we can also "simulate" different heads with a single query matrix, single key matrix, and single value matrix.
<br><br>
We'll do both. First we'll create *query*, *key*, and *value* vectors using separate weights per head.


In [5]:
# an example of 12 dimensional embeddings processed by three attentions heads
batch_size = 1
seq_len = 3
embed_dim = 12
num_heads = 3
head_dim = embed_dim // num_heads

print(f"Dimension of each head: {head_dim}")

Dimension of each head: 4


### Using separate weight matrices per head

Suppose these are our input embeddings. Here we have a batch of 1 containing a sequence of length 3, with each element being a 12-dimensional embedding.

In [6]:
x = np.random.rand(batch_size, seq_len, embed_dim).round(1)
print("Input shape: ", x.shape, "\n")
print("Input:\n", x)

Input shape:  (1, 3, 12) 

Input:
 [[[0.9 0.1 0.9 0.1 0.3 0.9 0.2 0.7 0.8 0.  0.2 0.5]
  [0.1 0.  0.8 0.1 0.  0.8 0.9 0.5 0.  0.6 0.1 0.7]
  [0.1 0.6 0.1 0.5 0.6 0.2 0.3 0.4 0.6 0.7 0.8 1. ]]]


We'll declare three sets of *query* weights (one for each head), three sets of *key* weights, and three sets of *value* weights. Remember each weight matrix should have a dimension of $\text{d}\ \text{x}\ \text{d/h}$.

In [9]:
# The query weights for each head
wq0 = np.random.rand(embed_dim, head_dim).round(1)
wq1 = np.random.rand(embed_dim, head_dim).round(1)
wq2 = np.random.rand(embed_dim, head_dim).round(1)

# The key weights for each head
wk0 = np.random.rand(embed_dim, head_dim).round(1)
wk1 = np.random.rand(embed_dim, head_dim).round(1)
wk2 = np.random.rand(embed_dim, head_dim).round(1)

# The value weights for each head
wv0 = np.random.rand(embed_dim, head_dim).round(1)
wv1 = np.random.rand(embed_dim, head_dim).round(1)
wv2 = np.random.rand(embed_dim, head_dim).round(1)

print("The three sets of query weights (one for each head):")
print("wq0:\n", wq0)
print("wq1:\n", wq1)
print("wq2:\n", wq1)

The three sets of query weights (one for each head):
wq0:
 [[0.1 0.7 0.7 0.5]
 [0.9 0.2 0.6 0.8]
 [0.5 0.8 0.7 0.9]
 [0.7 0.9 0.5 0.3]
 [0.9 0.9 0.7 0.2]
 [0.5 0.9 0.9 0.8]
 [0.7 0.7 0.4 0.9]
 [0.4 0.  0.3 0.8]
 [0.6 0.6 0.4 0.7]
 [0.1 0.6 0.4 0.2]
 [0.4 0.6 0.1 0.3]
 [0.8 0.9 0.5 0.6]]
wq1:
 [[0.9 0.7 0.9 0.3]
 [0.3 0.1 0.4 1. ]
 [0.7 0.3 0.5 0.6]
 [0.6 0.2 0.4 0.8]
 [0.  0.  0.2 0.6]
 [1.  0.5 0.9 0.2]
 [0.7 0.5 0.7 0.3]
 [0.7 0.3 1.  0.4]
 [0.4 0.6 0.9 1. ]
 [0.1 0.2 0.  0.1]
 [0.  0.2 0.6 0.6]
 [0.7 0.7 0.9 0.8]]
wq2:
 [[0.9 0.7 0.9 0.3]
 [0.3 0.1 0.4 1. ]
 [0.7 0.3 0.5 0.6]
 [0.6 0.2 0.4 0.8]
 [0.  0.  0.2 0.6]
 [1.  0.5 0.9 0.2]
 [0.7 0.5 0.7 0.3]
 [0.7 0.3 1.  0.4]
 [0.4 0.6 0.9 1. ]
 [0.1 0.2 0.  0.1]
 [0.  0.2 0.6 0.6]
 [0.7 0.7 0.9 0.8]]


We'll generate our *queries*, *keys*, and *values* for each head by multiplying our input by the weights.

In [11]:
# Geneated queries, keys, and values for the first head
q0 = np.dot(x, wq0)
k0 = np.dot(x, wk0)
v0 = np.dot(x, wv0)

# Geneated queries, keys, and values for the second head
q1 = np.dot(x, wq1)
k1 = np.dot(x, wk1)
v1 = np.dot(x, wv1)

# Geneated queries, keys, and values for the second head
q2 = np.dot(x, wq2)
k2 = np.dot(x, wk2)
v2 = np.dot(x, wv2)

print("Q, K, and V for first head:\n")

print(f"q0 {q0.shape}:\n", q0, "\n")
print(f"k0 {k0.shape}:\n", k0, "\n")
print(f"v0 {v0.shape}:\n", v0)

Q, K, and V for first head:

q0 (1, 3, 4):
 [[[3.67 4.38 3.06 3.6 ]
  [3.41 4.08 3.14 3.71]
  [3.01 3.58 2.93 3.  ]]] 

k0 (1, 3, 4):
 [[[3.59 3.33 2.19 3.24]
  [3.82 3.57 2.27 3.32]
  [3.13 3.07 2.12 3.26]]] 

v0 (1, 3, 4):
 [[[2.54 4.   3.93 3.58]
  [2.92 3.83 3.23 3.8 ]
  [2.85 3.4  3.5  3.37]]]


Now that we have our Q, K, V vectors, we can just pass them to our self-attention operation. Here we're calculating the output and attention weights for the first head.

In [12]:
out0, attn_weights0 = scaled_dot_product_attention(q0, k0, v0)

print("Output from first attention head: ", out0, "\n")
print("Attention weights from first head: ", attn_weights0)

Output from first attention head:  tf.Tensor(
[[[2.833825  3.8457968 3.3957014 3.7308974]
  [2.8301964 3.8441498 3.4033847 3.7260363]
  [2.8210227 3.8409605 3.422514  3.7145672]]], shape=(1, 3, 4), dtype=float32) 

Attention weights from first head:  tf.Tensor(
[[[0.21769002 0.73298293 0.04932702]
  [0.22593231 0.7176527  0.05641498]
  [0.24716169 0.6806128  0.07222551]]], shape=(1, 3, 3), dtype=float32)


In [13]:
# Here are the other two (attention weights are ignored)
out1, _ = scaled_dot_product_attention(q1, k1, v1)
out2, _ = scaled_dot_product_attention(q2, k2, v2)

print("Output from second attention head: ", out1, "\n")
print("Output from third attention head: ", out2,)

Output from second attention head:  tf.Tensor(
[[[2.4284098 2.32754   3.0384266 4.010263 ]
  [2.441395  2.312757  3.039845  4.000343 ]
  [2.4330301 2.3443832 3.0244172 4.004699 ]]], shape=(1, 3, 4), dtype=float32) 

Output from third attention head:  tf.Tensor(
[[[3.0552158 1.5740515 3.0670934 3.2499583]
  [3.0592449 1.5751076 3.068358  3.2518098]
  [3.036745  1.5701491 3.061039  3.2413504]]], shape=(1, 3, 4), dtype=float32)


In [14]:
# once we have each head's output, 
# we concatenate them and then put them through a linear layer for further processing
combined_out_a = np.concatenate((out0, out1, out2), axis=-1)
print(f"Combined output from all heads {combined_out_a.shape}:")
print(combined_out_a)

# The final step would be to run combined_out_a through a linear/dense layer 
# for further processing.

Combined output from all heads (1, 3, 12):
[[[2.833825  3.8457968 3.3957014 3.7308974 2.4284098 2.32754   3.0384266
   4.010263  3.0552158 1.5740515 3.0670934 3.2499583]
  [2.8301964 3.8441498 3.4033847 3.7260363 2.441395  2.312757  3.039845
   4.000343  3.0592449 1.5751076 3.068358  3.2518098]
  [2.8210227 3.8409605 3.422514  3.7145672 2.4330301 2.3443832 3.0244172
   4.004699  3.036745  1.5701491 3.061039  3.2413504]]]


So that's a complete run of **multi-head self-attention** using separate sets of weights per head.<br>

### Using a single query weight matrix, single key weight matrix, and single value weight matrix.

These were our separate per-head query weights:

In [16]:
print("Query weights for first head: \n", wq0, "\n")
print("Query weights for second head: \n", wq1, "\n")
print("Query weights for third head: \n", wq2)

Query weights for first head: 
 [[0.1 0.7 0.7 0.5]
 [0.9 0.2 0.6 0.8]
 [0.5 0.8 0.7 0.9]
 [0.7 0.9 0.5 0.3]
 [0.9 0.9 0.7 0.2]
 [0.5 0.9 0.9 0.8]
 [0.7 0.7 0.4 0.9]
 [0.4 0.  0.3 0.8]
 [0.6 0.6 0.4 0.7]
 [0.1 0.6 0.4 0.2]
 [0.4 0.6 0.1 0.3]
 [0.8 0.9 0.5 0.6]] 

Query weights for second head: 
 [[0.9 0.7 0.9 0.3]
 [0.3 0.1 0.4 1. ]
 [0.7 0.3 0.5 0.6]
 [0.6 0.2 0.4 0.8]
 [0.  0.  0.2 0.6]
 [1.  0.5 0.9 0.2]
 [0.7 0.5 0.7 0.3]
 [0.7 0.3 1.  0.4]
 [0.4 0.6 0.9 1. ]
 [0.1 0.2 0.  0.1]
 [0.  0.2 0.6 0.6]
 [0.7 0.7 0.9 0.8]] 

Query weights for third head: 
 [[0.4 0.1 0.6 0.8]
 [0.7 0.9 0.5 0.6]
 [0.6 0.9 0.6 0.2]
 [0.5 0.8 1.  0.8]
 [0.8 0.6 0.7 0. ]
 [0.8 0.3 0.3 0.6]
 [0.6 0.  0.7 0.9]
 [0.5 0.9 0.7 0.6]
 [0.5 0.4 0.1 0.3]
 [0.5 0.9 0.6 0.8]
 [0.5 0.3 0.4 0.4]
 [0.6 0.7 0.8 0. ]]


Suppose instead of declaring three separate query weight matrices, we had declared one. i.e. a single $d\ x\ d$ matrix. We're concatenating our per-head query weights here instead of declaring a new set of weights so that we get the same results.

In [17]:
wq = np.concatenate((wq0, wq1, wq2), axis=1)
print(f"Single query weight matrix {wq.shape}: \n", wq)

Single query weight matrix (12, 12): 
 [[0.1 0.7 0.7 0.5 0.9 0.7 0.9 0.3 0.4 0.1 0.6 0.8]
 [0.9 0.2 0.6 0.8 0.3 0.1 0.4 1.  0.7 0.9 0.5 0.6]
 [0.5 0.8 0.7 0.9 0.7 0.3 0.5 0.6 0.6 0.9 0.6 0.2]
 [0.7 0.9 0.5 0.3 0.6 0.2 0.4 0.8 0.5 0.8 1.  0.8]
 [0.9 0.9 0.7 0.2 0.  0.  0.2 0.6 0.8 0.6 0.7 0. ]
 [0.5 0.9 0.9 0.8 1.  0.5 0.9 0.2 0.8 0.3 0.3 0.6]
 [0.7 0.7 0.4 0.9 0.7 0.5 0.7 0.3 0.6 0.  0.7 0.9]
 [0.4 0.  0.3 0.8 0.7 0.3 1.  0.4 0.5 0.9 0.7 0.6]
 [0.6 0.6 0.4 0.7 0.4 0.6 0.9 1.  0.5 0.4 0.1 0.3]
 [0.1 0.6 0.4 0.2 0.1 0.2 0.  0.1 0.5 0.9 0.6 0.8]
 [0.4 0.6 0.1 0.3 0.  0.2 0.6 0.6 0.5 0.3 0.4 0.4]
 [0.8 0.9 0.5 0.6 0.7 0.7 0.9 0.8 0.6 0.7 0.8 0. ]]


In [18]:
# In the same vein, pretend we declared a single key weight matrix, and single value weight matrix.
wk = np.concatenate((wk0, wk1, wk2), axis=1)
wv = np.concatenate((wv0, wv1, wv2), axis=1)

print(f"Single key weight matrix {wk.shape}:\n", wk, "\n")
print(f"Single value weight matrix {wv.shape}:\n", wv)

Single key weight matrix (12, 12):
 [[0.5 0.4 0.5 0.9 1.  0.6 0.2 0.5 0.9 0.8 0.8 0.4]
 [0.9 0.9 0.  0.7 0.9 0.7 0.4 0.4 0.7 0.9 0.2 0.7]
 [0.6 0.8 0.7 0.7 0.9 0.4 0.7 0.3 0.  0.9 0.2 0.5]
 [0.9 0.4 0.5 1.  0.2 0.6 0.8 0.4 0.9 0.5 0.8 0.8]
 [0.5 0.3 0.2 0.9 0.3 0.9 0.5 1.  0.3 0.5 0.1 0. ]
 [0.9 0.6 0.5 0.4 0.7 0.1 0.8 0.5 0.8 0.1 0.7 0.2]
 [0.4 0.8 0.1 1.  0.4 0.6 0.1 0.8 0.5 0.5 0.8 0.3]
 [0.1 0.7 0.9 0.5 0.7 0.  0.8 1.  0.3 0.8 0.2 0.2]
 [0.7 0.5 0.2 0.1 0.5 0.9 0.7 1.  0.5 0.3 0.  0.9]
 [0.3 0.5 0.4 0.7 0.8 0.6 0.  0.8 0.1 0.7 0.1 0.5]
 [0.6 0.6 0.8 0.1 0.9 0.1 0.6 0.9 0.  0.4 0.9 0.3]
 [0.9 0.4 0.2 0.  0.9 1.  0.7 0.  0.8 0.6 0.6 0.1]] 

Single value weight matrix (12, 12):
 [[0.9 0.2 0.3 0.9 0.7 0.3 0.5 0.6 0.4 1.  0.8 0.9]
 [0.9 0.7 0.4 0.2 0.3 0.5 0.  0.8 0.4 0.6 0.6 0.5]
 [0.6 0.5 0.8 0.7 0.1 0.7 1.  0.8 0.2 0.  0.  0.4]
 [0.2 1.  0.9 0.6 0.4 0.1 0.7 0.  0.3 0.1 0.7 0. ]
 [0.3 0.4 0.9 0.8 0.4 0.7 0.5 0.8 0.2 0.5 0.6 0. ]
 [0.2 0.  0.3 0.8 0.9 0.2 0.3 1.  0.6 0.1 0.5 0.2]
 [0.2

In [19]:
# calculate all our queries, keys, and values with three dot products
q_s = np.dot(x, wq)
k_s = np.dot(x, wk)
v_s = np.dot(x, wv)

print(f"Query vectors using a single weight matrix {q_s.shape}:\n", q_s)

Query vectors using a single weight matrix (1, 3, 12):
 [[[3.67 4.38 3.06 3.6  2.86 2.21 3.62 3.48 3.59 3.13 3.35 2.26]
  [3.41 4.08 3.14 3.71 3.4  2.36 3.8  3.   3.63 3.3  3.66 3.18]
  [3.01 3.58 2.93 3.   2.13 1.63 2.82 3.2  3.24 3.27 2.85 2.18]]]


Somehow, we need to separate these vectors such they're treated like three separate sets by the self-attention operation.

In [20]:
print(q0, "\n")
print(q1, "\n")
print(q2)

[[[3.67 4.38 3.06 3.6 ]
  [3.41 4.08 3.14 3.71]
  [3.01 3.58 2.93 3.  ]]] 

[[[2.86 2.21 3.62 3.48]
  [3.4  2.36 3.8  3.  ]
  [2.13 1.63 2.82 3.2 ]]] 

[[[3.59 3.13 3.35 2.26]
  [3.63 3.3  3.66 3.18]
  [3.24 3.27 2.85 2.18]]]


Notice how each set of per-head queries looks like we took the combined queries, and chopped them vertically every four dimensions.
<br><br>
We can split our combined queries into $\text{d}\ \text{x}\ \text{d/h}$ heads using **reshape** and **transpose**.<br><br>
The first step is to *reshape* our combined queries from a shape of:<br>
(batch_size, seq_len, embed_dim)<br>

into a shape of<br>
 (batch_size, seq_len, num_heads, head_dim).
 <br>

In [21]:
# Note: we can achieve the same thing by passing -1 instead of seq_len
q_s_reshaped = tf.reshape(q_s, (batch_size, seq_len, num_heads, head_dim))

print(f"Combined queries: {q_s.shape}\n", q_s, "\n")
print(f"Reshaped into separate heads: {q_s_reshaped.shape}\n", q_s_reshaped)

Combined queries: (1, 3, 12)
 [[[3.67 4.38 3.06 3.6  2.86 2.21 3.62 3.48 3.59 3.13 3.35 2.26]
  [3.41 4.08 3.14 3.71 3.4  2.36 3.8  3.   3.63 3.3  3.66 3.18]
  [3.01 3.58 2.93 3.   2.13 1.63 2.82 3.2  3.24 3.27 2.85 2.18]]] 

Reshaped into separate heads: (1, 3, 3, 4)
 tf.Tensor(
[[[[3.67 4.38 3.06 3.6 ]
   [2.86 2.21 3.62 3.48]
   [3.59 3.13 3.35 2.26]]

  [[3.41 4.08 3.14 3.71]
   [3.4  2.36 3.8  3.  ]
   [3.63 3.3  3.66 3.18]]

  [[3.01 3.58 2.93 3.  ]
   [2.13 1.63 2.82 3.2 ]
   [3.24 3.27 2.85 2.18]]]], shape=(1, 3, 3, 4), dtype=float64)


At this point, we have our desired shape. The next step is to *transpose* it such that simulates vertically chopping our combined queries. By transposing, our matrix dimensions become:<br>
(batch_size, num_heads, seq_len, head_dim)<br>

In [22]:
q_s_transposed = tf.transpose(q_s_reshaped, perm=[0, 2, 1, 3]).numpy()

print(f"Queries transposed into \"separate\" heads {q_s_transposed.shape}:\n", 
      q_s_transposed)

Queries transposed into "separate" heads (1, 3, 3, 4):
 [[[[3.67 4.38 3.06 3.6 ]
   [3.41 4.08 3.14 3.71]
   [3.01 3.58 2.93 3.  ]]

  [[2.86 2.21 3.62 3.48]
   [3.4  2.36 3.8  3.  ]
   [2.13 1.63 2.82 3.2 ]]

  [[3.59 3.13 3.35 2.26]
   [3.63 3.3  3.66 3.18]
   [3.24 3.27 2.85 2.18]]]]


If we compare this against the separate per-head queries we calculated previously, we see the same result except we now have all our queries in a single matrix.

In [23]:
print("The separate per-head query matrices from before: ")
print(q0, "\n")
print(q1, "\n")
print(q2)

The separate per-head query matrices from before: 
[[[3.67 4.38 3.06 3.6 ]
  [3.41 4.08 3.14 3.71]
  [3.01 3.58 2.93 3.  ]]] 

[[[2.86 2.21 3.62 3.48]
  [3.4  2.36 3.8  3.  ]
  [2.13 1.63 2.82 3.2 ]]] 

[[[3.59 3.13 3.35 2.26]
  [3.63 3.3  3.66 3.18]
  [3.24 3.27 2.85 2.18]]]


Let's do the exact same thing with our combined keys and values.

In [26]:
k_s_transposed = tf.transpose(tf.reshape(k_s, (batch_size, seq_len, num_heads, head_dim)), perm=[0,2,1,3]).numpy()
v_s_transposed = tf.transpose(tf.reshape(v_s, (batch_size, seq_len, num_heads, head_dim)), perm=[0,2,1,3]).numpy()

print(f"Keys for all heads in a single matrix {k_s.shape}: \n", k_s_transposed, "\n")
print(f"Values for all heads in a single matrix {v_s.shape}: \n", v_s_transposed)

Keys for all heads in a single matrix (1, 3, 12): 
 [[[[3.59 3.33 2.19 3.24]
   [3.82 3.57 2.27 3.32]
   [3.13 3.07 2.12 3.26]]

  [[3.6  3.66 3.25 3.91]
   [4.2  3.19 3.01 3.34]
   [3.67 3.27 2.7  3.81]]

  [[2.41 3.12 2.36 2.23]
   [3.16 3.32 3.12 2.09]
   [1.96 3.29 1.6  2.33]]]] 

Values for all heads in a single matrix (1, 3, 12): 
 [[[[2.54 4.   3.93 3.58]
   [2.92 3.83 3.23 3.8 ]
   [2.85 3.4  3.5  3.37]]

  [[2.35 2.31 3.1  4.08]
   [2.66 2.1  3.04 3.83]
   [2.43 2.75 2.76 3.97]]

  [[2.51 1.27 2.94 3.02]
   [3.1  1.59 3.08 3.27]
   [2.11 1.84 2.63 2.75]]]]


Set up this way, we can now calculate the outputs from all attention heads with a single call to our self-attention operation.

In [27]:
all_heads_output, all_attn_weights = scaled_dot_product_attention(q_s_transposed,
                                                                  k_s_transposed,
                                                                  v_s_transposed)
print("Self attention output:\n", all_heads_output)

Self attention output:
 tf.Tensor(
[[[[2.833825  3.8457968 3.3957014 3.7308974]
   [2.8301964 3.8441498 3.4033847 3.7260363]
   [2.8210227 3.8409605 3.422514  3.7145672]]

  [[2.4284096 2.3275397 3.0384262 4.0102625]
   [2.441395  2.312757  3.039845  4.000343 ]
   [2.4330301 2.3443832 3.0244172 4.004699 ]]

  [[3.0552158 1.5740515 3.0670934 3.2499583]
   [3.0592449 1.5751076 3.068358  3.2518098]
   [3.036745  1.5701491 3.061039  3.2413504]]]], shape=(1, 3, 3, 4), dtype=float32)


In [28]:
# As a sanity check, we can compare this against the outputs from individual heads we calculated earlier:
print("Per head outputs from using separate sets of weights per head:")
print(out0, "\n")
print(out1, "\n")
print(out2)

Per head outputs from using separate sets of weights per head:
tf.Tensor(
[[[2.833825  3.8457968 3.3957014 3.7308974]
  [2.8301964 3.8441498 3.4033847 3.7260363]
  [2.8210227 3.8409605 3.422514  3.7145672]]], shape=(1, 3, 4), dtype=float32) 

tf.Tensor(
[[[2.4284098 2.32754   3.0384266 4.010263 ]
  [2.441395  2.312757  3.039845  4.000343 ]
  [2.4330301 2.3443832 3.0244172 4.004699 ]]], shape=(1, 3, 4), dtype=float32) 

tf.Tensor(
[[[3.0552158 1.5740515 3.0670934 3.2499583]
  [3.0592449 1.5751076 3.068358  3.2518098]
  [3.036745  1.5701491 3.061039  3.2413504]]], shape=(1, 3, 4), dtype=float32)


To get the final concatenated result, we need to reverse our **reshape** and **transpose** operation, starting with the **transpose** this time.

In [29]:
combined_out_b = tf.reshape(tf.transpose(all_heads_output, perm=[0, 2, 1, 3]),
                            shape=(batch_size, seq_len, embed_dim))

print("Final output from using single query, key, value matrices:\n", 
      combined_out_b, "\n")
print("Final output from using separate query, key, value matrices per head:\n", 
      combined_out_a)

Final output from using single query, key, value matrices:
 tf.Tensor(
[[[2.833825  3.8457968 3.3957014 3.7308974 2.4284096 2.3275397 3.0384262
   4.0102625 3.0552158 1.5740515 3.0670934 3.2499583]
  [2.8301964 3.8441498 3.4033847 3.7260363 2.441395  2.312757  3.039845
   4.000343  3.0592449 1.5751076 3.068358  3.2518098]
  [2.8210227 3.8409605 3.422514  3.7145672 2.4330301 2.3443832 3.0244172
   4.004699  3.036745  1.5701491 3.061039  3.2413504]]], shape=(1, 3, 12), dtype=float32) 

Final output from using separate query, key, value matrices per head:
 [[[2.833825  3.8457968 3.3957014 3.7308974 2.4284098 2.32754   3.0384266
   4.010263  3.0552158 1.5740515 3.0670934 3.2499583]
  [2.8301964 3.8441498 3.4033847 3.7260363 2.441395  2.312757  3.039845
   4.000343  3.0592449 1.5751076 3.068358  3.2518098]
  [2.8210227 3.8409605 3.422514  3.7145672 2.4330301 2.3443832 3.0244172
   4.004699  3.036745  1.5701491 3.061039  3.2413504]]]


### Putting everything together
We can encapsulate everything we just covered in a class.

In [3]:
class MultiHeadSelfAttention(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads):
    super().__init__()
    self.d_model = d_model  # model dimension
    self.num_heads = num_heads

    self.d_head = self.d_model // self.num_heads # head_dim

    self.wq = tf.keras.layers.Dense(self.d_model)
    self.wk = tf.keras.layers.Dense(self.d_model)
    self.wv = tf.keras.layers.Dense(self.d_model)

    # Linear layer to generate the final output 
    self.dense = tf.keras.layers.Dense(self.d_model)

  def split_heads(self, x):
    batch_size = x.shape[0]

    split_inputs = tf.reshape(x, (batch_size, -1, self.num_heads, self.d_head))
    return tf.transpose(split_inputs, perm=[0, 2, 1, 3])

  def merge_heads(self, x):
    batch_size = x.shape[0]

    merged_inputs = tf.transpose(x, perm=[0, 2, 1, 3])
    return tf.reshape(merged_inputs, (batch_size, -1, self.d_model))
  
  def call(self, q, k, v, mask):
    qs = self.wq(q)
    ks =self.wk(k)
    vs = self.wv(v)
    
    qs = self.split_heads(qs)
    ks = self.split_heads(ks)
    vs = self.split_heads(vs)

    output, attn_weights = scaled_dot_product_attention(qs, ks, vs, mask)
    output = self.merge_heads(output)

    return self.dense(output), attn_weights

In [7]:
# sanity check
mhsa = MultiHeadSelfAttention(12, 3)

output, attn_weights = mhsa(x, x, x, None)
print(f"MHSA output{output.shape}:")
print(output)

MHSA output(1, 3, 12):
tf.Tensor(
[[[-1.0467246  -0.38746828 -0.00595006 -0.3837807  -0.24663472
   -0.2171552   0.18601058  0.33198413 -0.30640054  0.04098597
    0.09575015 -0.3390737 ]
  [-1.0250382  -0.38153133 -0.01544506 -0.34862208 -0.2214207
   -0.19764036  0.17804733  0.38613892 -0.30171084  0.00365527
    0.07438526 -0.2792571 ]
  [-1.0018137  -0.3578776   0.03382985 -0.4144197  -0.28377312
   -0.24793994  0.20015937  0.36169147 -0.32287133  0.05772191
    0.15491143 -0.31276175]]], shape=(1, 3, 12), dtype=float32)


# Encoder Block


We can now build our **Encoder Block**. In addition to the **Multi-Head Self Attention** layer, the **Encoder Block** also has **skip connections**, **layer normalization steps**, and a **two-layer feed-forward neural network**. The original **Attention Is All You Need** paper also included some **dropout** applied to the self-attention output which isn't shown in the illustration below (see references for a link to the paper).

<div>
<img src="https://drive.google.com/uc?export=view&id=1D8sLDyQMqqhCjHWOn-I7rZKHugWxFyLy" width="500"/>
</div>

Since a two-layer feed forward neural network is used in multiple places in the transformer, here's a function which creates and returns one.

In [9]:
tf.keras.models.Sequential == tf.keras.Sequential

True

In [8]:
def feed_forward_network(d_model, hidden_dim):
  return tf.keras.Sequential([
      tf.keras.layers.Dense(hidden_dim, activation='relu'),
      tf.keras.layers.Dense(d_model)
  ])

This is our encoder block containing all the layers and steps from the preceding illustration (plus dropout).

In [9]:
class EncoderBlock(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, hidden_dim, dropout_rate=0.1):
    super().__init__()
    
    self.mhsa = MultiHeadSelfAttention(d_model, num_heads)
    self.ffn  = feed_forward_network(d_model, hidden_dim)

    self.dropout   = tf.keras.layers.Dropout(dropout_rate)
    self.layernorm = tf.keras.layers.LayerNormalization()

  def call(self, x, training, mask):
    mhsa_output, attn_weights = self.mhsa(x, x, x, mask)
    mhsa_output = self.dropout(mhsa_output, training=training)
    mhsa_output = self.layernorm(x + mhsa_output)
    
    ffn_output = self.ffn(mhsa_output)
    ffn_output = self.dropout(ffn_output, training=training)
    output     = self.layernorm(ffn_output + mhsa_output)
    
    return output, attn_weights


Suppose we have an embedding dimension of 12, and we want 3 attention heads and a feed forward network with a hidden dimension of 48 (4x the embedding dimension). We would declare and use a single encoder block like so:

In [10]:
encoder_block = EncoderBlock(12, 3, 48)

block_output,  _ = encoder_block(x, True, None)
print(f"Output from single encoder block {block_output.shape}:")
print(block_output)

Output from single encoder block (1, 3, 12):
tf.Tensor(
[[[-0.3716895   0.9727424   0.59347695  0.9779479  -0.83920187
    0.04553704  1.6231296  -0.6519545  -0.5456049   0.65776443
   -0.21320732 -2.2489407 ]
  [-0.19393569  0.69055057  0.01404015  0.87591046 -0.6793242
    0.39865723  1.7146664  -0.03991906 -1.5244924   1.192019
   -0.7731841  -1.6749882 ]
  [-0.1703595   1.1842399  -0.554807    0.04148576 -0.15297136
   -0.34296605 -0.45370454 -0.7941657  -0.20880504  2.1104808
    1.1808958  -1.8393229 ]]], shape=(1, 3, 12), dtype=float32)


# Word and Positional Embeddings

Let's now deal with the actual input to the **initial** encoder block. The inputs are going to be *positional word embeddings*. That is, word embeddings with some positional information added to them.
<br>

Let's start with **subword** tokenization. For demonstration, we'll use a subword tokenizer called **BPEmb**. It uses **Byte-Pair Encoding** and supports over two hundred languages. 

https://bpemb.h-its.org/


In [11]:
# Load the English tokenizer.
bpemb_en = BPEmb(lang='en')

downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model


100%|██████████| 400869/400869 [00:00<00:00, 657807.23B/s]


downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz


100%|██████████| 3784656/3784656 [00:01<00:00, 3455090.31B/s]


In [12]:
# The library comes with embeddings for a number of words
bpemb_vocab_size, bpemb_embed_size = bpemb_en.vectors.shape

print("Vocabulary size:", bpemb_vocab_size)
print("Embedding size:", bpemb_embed_size)

Vocabulary size: 10000
Embedding size: 100


In [13]:
# Embedding for the word "car"
bpemb_en.vectors[bpemb_en.words.index('car')]

array([-0.305548, -0.325598, -0.134716, -0.078735, -0.660545,  0.076211,
       -0.735487,  0.124533, -0.294402,  0.459688,  0.030137,  0.174041,
       -0.224223,  0.486189, -0.504649, -0.459699,  0.315747,  0.477885,
        0.091398,  0.427867,  0.016524, -0.076833, -0.899727,  0.493158,
       -0.022309, -0.422785, -0.154148,  0.204981,  0.379834,  0.070588,
        0.196073, -0.368222,  0.473406,  0.007409,  0.004303, -0.007823,
       -0.19103 , -0.202509,  0.109878, -0.224521, -0.35741 , -0.611633,
        0.329958, -0.212956, -0.497499, -0.393839, -0.130101, -0.216903,
       -0.105595, -0.076007, -0.483942, -0.139704, -0.161647,  0.136985,
        0.415363, -0.360143,  0.038601, -0.078804, -0.030421,  0.324129,
        0.223378, -0.523636, -0.048317, -0.032248, -0.117367,  0.470519,
        0.225816, -0.222065, -0.225007, -0.165904, -0.334389, -0.20157 ,
        0.572352, -0.268794,  0.301929, -0.005563,  0.387491,  0.261031,
       -0.11613 ,  0.074982, -0.008433,  0.259987, 

We don't need the embeddings since we're going to use our own embedding layer. What we're interested in are the subword tokens and their respective ids. The ids will be used as indexes into our embedding layer.<br>

 **BPEmb** places underscores in front of any tokens which are whole words or intended to begin words.<br>

Remember that subword tokenizers are trained using count frequencies over a corpus. So these subword tokens are specific to **BPEmb**. Another subword tokenizer may output something different. This is why it's important that when we use a pretrained model, we make sure to use the pretrained model's tokenizer. 

In [31]:
sample_sentence = "Where can I find a pizzeria?"
tokens = bpemb_en.encode(sample_sentence)
tokens

['▁where', '▁can', '▁i', '▁find', '▁a', '▁p', 'iz', 'zer', 'ia', '?']

In [32]:
# We can retrieve each subword token's respective id using the *encode_ids* method
token_seq = np.array(bpemb_en.encode_ids('Where can I find a pizzeria?'))
token_seq

array([ 571,  280,  386, 1934,    4,   24,  248, 4339,  177, 9967])

Now that we have a way to tokenize and vectorize sentences, we can declare and use an embedding layer with the same vocabulary size as **BPEmb** and a desired embedding size.

In [33]:
token_embed = tf.keras.layers.Embedding(bpemb_vocab_size, embed_dim)
token_embeddings = token_embed(token_seq)

# The untrained embeddings for our sample sentence.
print("Embeddings for: ", sample_sentence)
print(token_embeddings)

Embeddings for:  Where can I find a pizzeria?
tf.Tensor(
[[ 0.01448974 -0.03366766 -0.00832275  0.01385177  0.04119352 -0.00592104
  -0.04922706 -0.0168265   0.03531576  0.02216787 -0.04024782 -0.04667196]
 [-0.04727457  0.01700499  0.0381448  -0.03491439  0.02874904 -0.00526644
   0.02942361  0.00758667 -0.00871487 -0.02891201  0.02335632 -0.02620664]
 [ 0.04148301 -0.03412551 -0.00809424 -0.00841127 -0.03646825  0.03917141
  -0.03420101 -0.02235417  0.03981816 -0.01802616 -0.00026751 -0.00799946]
 [ 0.04622055 -0.04998402  0.020508    0.02776426 -0.00971972  0.02428455
  -0.01700937  0.00117232  0.03432995  0.03076286  0.02077112 -0.03850199]
 [-0.02762603 -0.00742032 -0.02562984  0.02848845 -0.0149024  -0.0446
  -0.03575293  0.04781846  0.01220576 -0.01164117  0.04130632 -0.04780921]
 [ 0.04084361  0.01199125 -0.04959185 -0.0034986   0.01827468  0.00143937
   0.00865041  0.04814276 -0.03797473  0.04411812 -0.01110473  0.03618764]
 [-0.04814401 -0.01833753 -0.00677134  0.03797131  0.

Next, we need to add *positional* information to each token embedding. The original paper used sinusoidals but it's more common these days to just use another set of embeddings. We'll do the latter here.<br>

Here, we're declaring an embedding layer with rows equalling a maximum sequence length and columns equalling our token embedding size. We then generate a vector of position ids.

In [27]:
max_seq_len = 256
pos_embed = tf.keras.layers.Embedding(max_seq_len, embed_dim)

# Generate ids for each position of the token sequence.
pos_idx = tf.range(len(token_seq))
pos_idx

NameError: ignored

We'll use these position ids to index into the positional embedding layer.

In [36]:
# These are our positon embeddings
position_embeddings = pos_embed(pos_idx)
print("Position embeddings for the input sequence\n", position_embeddings)

Position embeddings for the input sequence
 tf.Tensor(
[[-1.13660097e-03 -1.60347335e-02  9.01959091e-03 -3.10138948e-02
   1.29749440e-02 -7.72383064e-03  1.52164139e-02 -3.39352638e-02
   7.02438504e-03  6.73501566e-03  2.41510570e-05 -1.55613795e-02]
 [-3.57269123e-03 -2.86044478e-02 -3.23886275e-02 -1.80147588e-04
  -3.89828458e-02 -1.98534876e-03  4.17050458e-02  2.29485370e-02
   1.35246851e-02 -3.60701457e-02 -6.39129430e-04  1.47822164e-02]
 [-7.78924301e-03  4.42704298e-02  1.64815225e-02  1.85513832e-02
  -2.62727737e-02  1.85221471e-02  1.23622529e-02  9.27203894e-03
  -2.87427437e-02 -4.18234617e-04  5.71832061e-05  2.09636204e-02]
 [-1.27251260e-02 -4.71079014e-02 -3.82656343e-02 -3.47413048e-02
   3.87997739e-02  2.04805396e-02  3.33256014e-02  1.59940384e-02
   2.13720463e-02  4.19230796e-02  3.15302126e-02  2.68095471e-02]
 [ 3.52654196e-02 -2.17069387e-02  4.10317667e-02  1.63805820e-02
  -1.80840120e-02 -2.24395879e-02 -2.03658100e-02 -4.42672260e-02
  -9.07685608e-03

The final step is to add our token and position embeddings. The result will be the input to the first encoder block.

In [37]:
input = token_embeddings + position_embeddings
print("Input to the initial encoder block:\n", input)

Input to the initial encoder block:
 tf.Tensor(
[[ 0.01335314 -0.04970239  0.00069684 -0.01716213  0.05416846 -0.01364487
  -0.03401065 -0.05076176  0.04234014  0.02890289 -0.04022367 -0.06223334]
 [-0.05084726 -0.01159946  0.00575617 -0.03509453 -0.01023381 -0.00725179
   0.07112865  0.03053521  0.00480982 -0.06498215  0.02271719 -0.01142442]
 [ 0.03369377  0.01014492  0.00838728  0.01014012 -0.06274102  0.05769356
  -0.02183876 -0.01308214  0.01107541 -0.0184444  -0.00021032  0.01296416]
 [ 0.03349543 -0.09709191 -0.01775763 -0.00697704  0.02908006  0.04476508
   0.01631624  0.01716635  0.055702    0.07268594  0.05230133 -0.01169244]
 [ 0.00763939 -0.02912726  0.01540192  0.04486903 -0.03298641 -0.06703959
  -0.05611874  0.00355123  0.0031289  -0.03732048  0.01345411 -0.04094407]
 [ 0.01998172  0.00828333 -0.01517054 -0.05147992 -0.00461484  0.04957971
  -0.0349167   0.04764981 -0.01404717  0.06276959  0.02993267  0.01714082]
 [-0.09469967  0.01937742  0.00865435  0.03547692  0.02892

# Encoder

Now that we have an encoder block and a way to embed our tokens with position information, we can create the **encoder** itself.<br>

Given a batch of vectorized sequences, the encoder creates positional embeddings, runs them through its encoder blocks, and returns contextualized tokens.

In [14]:
class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_blocks, d_model, num_heads, hidden_dim, src_vocab_size, max_seq_len, dropout_rate=0.1):
    super().__init__()

    self.d_model = d_model
    self.max_seq_len = max_seq_len

    self.token_embed = tf.keras.layers.Embedding(src_vocab_size, self.d_model)
    self.pos_embed = tf.keras.layers.Embedding(max_seq_len, self.d_model)

    # The original Attention Is All You Need paper applied dropout to the
    # input before feeding it to the first encoder block
    self.dropout = tf.keras.layers.Dropout(dropout_rate)

    # Create encoder blocks
    self.blocks = [EncoderBlock(self.d_model, num_heads, hidden_dim, dropout_rate)
                   for _ in range(num_blocks)]
  
  def call(self, input, training, mask):
    token_embeds = self.token_embed(input)

    # Generate position indices for a batch of input sequences
    num_pos = input.shape[0] * self.max_seq_len
    pos_idx = np.resize(np.arange(self.max_seq_len), num_pos)
    pos_idx = np.reshape(pos_idx, input.shape)
    pos_embeds = self.pos_embed(pos_idx)

    x = self.dropout(token_embeds + pos_embeds, training=training)
    
    # Run input through successive encoder blocks
    for block in self.blocks:
      x, weights = block(x, training, mask)
    
    return x, weights


If you're wondering about this code block here:


```
num_pos = input.shape[0] * self.max_seq_len
pos_idx = np.resize(np.arange(self.max_seq_len), num_pos)
pos_idx = np.reshape(pos_idx, input.shape)
pos_embeds = self.pos_embed(pos_idx)
```


This generates positional embeddings for a *batch* of input sequences. Suppose this was our batch of input sequences to the encoder.

In [17]:
# Batch of 3 sequences, each of length 10 (10 is also the 
# maximum sequence length in this case).
seqs = np.random.randint(0, 10000, size=(3, 10))
print(seqs.shape)
print(seqs)

(3, 10)
[[7565 5471 5184 7659 4843 5351 3232 7140 2735 3036]
 [ 637 5675 6435 8349 1770 7142 8985 2513 6068 8600]
 [5482 4930  994  745 5065 2991 1543  804 9863 5955]]


In [18]:
# We need to retrieve a positional embedding for every element in this batch.
# The first step is to create the respective positional ids...
pos_ids = np.resize(np.arange(seqs.shape[1]), seqs.shape[0] * seqs.shape[1])
print(pos_ids)

[0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]


In [19]:
# ...and then reshape them to match the input batch dimensions
pos_ids = np.reshape(pos_ids, (3, 10))
print(pos_ids.shape)
print(pos_ids)

(3, 10)
[[0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]]


In [28]:
# We can now retrieve position embeddings for every token embedding
pos_embed(pos_ids)

<tf.Tensor: shape=(3, 10, 12), dtype=float32, numpy=
array([[[-0.04305434, -0.02453849, -0.02423087, -0.00924901,
         -0.0244591 , -0.02212497,  0.02808353, -0.02292987,
          0.02081816,  0.01929775,  0.04014203, -0.03314801],
        [ 0.02030656, -0.0497223 , -0.00140219,  0.01579145,
          0.02903452, -0.02081794, -0.00895287,  0.02823159,
         -0.01854497, -0.00218279,  0.02976627, -0.03864694],
        [-0.01225422, -0.04528918,  0.04851009,  0.01066749,
          0.01510625, -0.00701993, -0.00150501, -0.00867059,
          0.0139587 , -0.02797397, -0.04383787, -0.00570369],
        [-0.03308626,  0.00466092, -0.02672722,  0.03316107,
          0.0490361 , -0.02492866,  0.0132092 , -0.00282953,
          0.04492934,  0.02720095, -0.03943354,  0.00880854],
        [ 0.0128976 , -0.0209771 ,  0.04239047, -0.04588555,
          0.03267778,  0.04865321,  0.00468792, -0.01059624,
         -0.01658805,  0.02663327, -0.01916671,  0.03830277],
        [-0.02320021, -0.04

In [29]:
# Let's try our encoder on a batch of sentences
input_batch = [
    "Where can I find a pizzeria?",
    "Mass hysteria over listeria.",
    "I ain't no circle back girl."
]

bpemb_en.encode(input_batch)

[['▁where', '▁can', '▁i', '▁find', '▁a', '▁p', 'iz', 'zer', 'ia', '?'],
 ['▁mass', '▁hy', 'ster', 'ia', '▁over', '▁l', 'ister', 'ia', '.'],
 ['▁i', '▁a', 'in', "'", 't', '▁no', '▁circle', '▁back', '▁girl', '.']]

In [30]:
input_seqs = bpemb_en.encode_ids(input_batch)
print("Vectorized inputs:")
input_seqs

Vectorized inputs:


[[571, 280, 386, 1934, 4, 24, 248, 4339, 177, 9967],
 [1535, 1354, 1238, 177, 380, 43, 871, 177, 9935],
 [386, 4, 6, 9937, 9915, 467, 5410, 810, 3692, 9935]]

In [31]:
# Note how the input sequences aren't the same length in this batch.
# In this case, we need to pad them out so that they are.
padded_input_seqs = tf.keras.preprocessing.sequence.pad_sequences(input_seqs, 
                                                                  padding='post',
                                                                  truncating='post')
print("Input to the encoder:")
print(padded_input_seqs.shape)
print(padded_input_seqs)

Input to the encoder:
(3, 10)
[[ 571  280  386 1934    4   24  248 4339  177 9967]
 [1535 1354 1238  177  380   43  871  177 9935    0]
 [ 386    4    6 9937 9915  467 5410  810 3692 9935]]


Since our input now has padding, now's a good time to cover **masking**.
<br>

So given a mask, wherever there's a mask position set to 0, the corresponding position in the attention scores will be set to *-inf*. The resulting attention weight for the position will then be zero and no attending will occur for that position.
<br>

We covered *look-ahead* masks for the decoder to prevent it from attending to future tokens, but we also need masks for padding.
<br>

In total, there are three masks involved:
1. The *encoder mask* to mask out any padding in the encoder sequences.

2. The *decoder mask* which is used in the decoder's **first** multi-head self-attention layer. It's a <u>combination of two masks</u>: one to account for the padding in target sequences, and the look-ahead mask.

3. The *memory mask* which is used in the decoder's **second** multi-head self-attention layer. The keys and values for this layer are going to be the encoder's output, and this mask will ensure the decoder doesn't attend to any encoder output which corresponds to padding. In practice, 1 and 3 are often the same.

The *scaled_dot_product_attention* function has this line:
```
  if mask is not None:
    scaled_scores = tf.where(mask==0, -np.inf, scaled_scores)
```

In [32]:
# Let's create an encoder mask for our batch of input sequences.
# Wherever there's padding, we want the mask position set to zero.
enc_mask = tf.cast(tf.math.not_equal(padded_input_seqs, 0), tf.float32)
print("Input:")
print(padded_input_seqs, '\n')
print("Encoder mask:")
print(enc_mask)

Input:
[[ 571  280  386 1934    4   24  248 4339  177 9967]
 [1535 1354 1238  177  380   43  871  177 9935    0]
 [ 386    4    6 9937 9915  467 5410  810 3692 9935]] 

Encoder mask:
tf.Tensor(
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]], shape=(3, 10), dtype=float32)


Keep in mind that the dimension of the attention matrix (for this example) is going to be:<br>
*(batch size, number of heads, query size, key size)*<br>
(3, 3, 10, 10)

In [33]:
# we need to expand the mask dimensions 
enc_mask = enc_mask[:, tf.newaxis, tf.newaxis, :]
enc_mask

<tf.Tensor: shape=(3, 1, 1, 10), dtype=float32, numpy=
array([[[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]],


       [[[1., 1., 1., 1., 1., 1., 1., 1., 1., 0.]]],


       [[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]]], dtype=float32)>

This way, the encoder mask will now be *broadcasted*.<br>
https://www.tensorflow.org/xla/broadcasting

In [34]:
# Now we can declare an encoder and pass it batches of vectorized sequences
num_encoder_blocks = 6
d_model = 12  # the embedding dimension used throughout
num_heads = 3

# Feed-forward network hidden dimension width
ffn_hidden_dim = 48

src_vocab_size = bpemb_vocab_size
max_input_seq_len = padded_input_seqs.shape[1]

encoder = Encoder(
    num_encoder_blocks,
    d_model,
    num_heads,
    ffn_hidden_dim,
    src_vocab_size,
    max_input_seq_len
)

In [35]:
# We can now pass our input sequences and mask to the encoder
encoder_output, attn_weights = encoder(padded_input_seqs, training=True, mask=enc_mask)

print(f"Encoder output {encoder_output.shape}:")
print(encoder_output)

Encoder output (3, 10, 12):
tf.Tensor(
[[[ 0.92893213  1.1867573   0.19462499  0.39530596  0.00631281
   -1.6360401  -0.7206076  -1.4299091   0.2929974  -1.35924
    0.90086746  1.239999  ]
  [-0.06947623  1.2353057  -0.2774222   0.81003     0.4535414
   -0.608405   -0.47404522 -1.6943238   0.4667239  -1.4225894
    1.9232383  -0.34257716]
  [ 0.11077741  1.3144714   0.10531245  0.43197262  0.14090824
   -1.4559658  -0.47059813 -1.5199034  -0.2603888  -1.1860625
    1.2969326   1.4925439 ]
  [ 0.9480911   1.3393903  -0.232166    0.6072952  -0.15742984
   -1.2569174  -0.3015812  -1.4833101  -0.02259316 -1.5423868
    1.4935411   0.6080667 ]
  [ 0.8408628   0.90129286 -1.0122358   0.2840256  -0.70243484
   -0.9836674  -0.251804   -0.8127806   0.63793546 -1.5873525
    1.9116813   0.77447706]
  [-0.01615997  2.3178575   0.23885302  0.55538577  0.17812982
   -1.2513822  -0.99101365 -0.82830644 -0.88112456 -0.7975826
    1.230519    0.24482422]
  [ 1.5429044   0.6584737   0.2055447   0.5493

# Decoder Block

Let's build the **Decoder Block**. Everything we did to create the **encoder** block applies here. The major differences are that the **Decoder Block** has:
1. a **Multi-Head Cross-Attention** layer which uses the encoder's outputs as the keys and values.

2. an extra skip/residual connection along with an extra layer normalization step.

<div>
<img src="https://drive.google.com/uc?export=view&id=1WVT4SX49bnta4uscOTF4xrsxFI4PbPER" width="500"/>
</div>

In [50]:
class DecoderBlock(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, hidden_dim, dropout_rate=0.1):
    super().__init__()

    self.mhsa = MultiHeadSelfAttention(d_model, num_heads)
    self.ffn = feed_forward_network(d_model, hidden_dim)
    self.dropout = tf.keras.layers.Dropout(dropout_rate)
    self.layernorm = tf.keras.layers.LayerNormalization()

  # Note the decoder block takes two masks. One for the first MHSA, another for the second MHSA.
  def call(self, encoder_output, target, training, decoder_mask, memory_mask):
    mhsa_output1, attn_weights = self.mhsa(target, target, target, decoder_mask)
    mhsa_output1 = self.dropout(mhsa_output1, training=training)
    mhsa_output1 = self.layernorm(mhsa_output1 + target)

    mhsa_output2, attn_weights = self.mhsa(mhsa_output1, encoder_output, encoder_output, memory_mask) # q, k, v
    mhsa_output2 = self.dropout(mhsa_output2, training=training)
    mhsa_output2 = self.layernorm(mhsa_output2 + mhsa_output1)

    ffn_output = self.ffn(mhsa_output2)
    ffn_output = self.dropout(ffn_output, training=training)
    output = self.layernorm(ffn_output + mhsa_output2)

    return output, attn_weights

# Decoder

The decoder is almost the same as the encoder except:
- it takes the encoder's output as part of its input,
- it takes two masks: the decoder mask and memory mask.

In [48]:
class Decoder(tf.keras.layers.Layer):
  def __init__(self, num_blocks, d_model, num_head, hidden_dim, target_vocab_size, max_seq_len, dropout_rate=0.1):
    super().__init__()

    self.max_seq_len = max_seq_len

    self.token_embed = tf.keras.layers.Embedding(target_vocab_size, d_model)
    self.pos_embed = tf.keras.layers.Embedding(max_seq_len, d_model)

    self.dropout = tf.keras.layers.Dropout(dropout_rate)
    
    self.blocks = [DecoderBlock(d_model, num_heads, hidden_dim, dropout_rate) 
                   for _ in range(num_blocks)]

  def call(self, encoder_output, target, training, decoder_mask, memory_mask):
    token_embeds = self.token_embed(target)

    # Generate position indices
    num_pos = target.shape[0] * self.max_seq_len
    pos_idx = np.resize(np.arange(self.max_seq_len), num_pos)
    pos_idx = np.reshape(pos_idx, target.shape)

    pos_embeds = self.pos_embed(pos_idx)

    x = self.dropout(token_embeds + pos_embeds, training=training)

    for block in self.blocks:
      x, weights = block(encoder_output, x, training, decoder_mask, memory_mask)
    
    return x, weights

Before we try the decoder, let's cover the masks involved. The decoder takes two masks:

The *decoder mask* which is a <u>combination of two masks</u>: one to account for the padding in target sequences, and the look-ahead mask. This mask is used in the decoder's **first** multi-head self-attention layer.

The *memory mask* which is used in the decoder's **second** multi-head self-attention. The keys and values for this layer are going to be the encoder's output, and this mask will ensure the decoder doesn't attend to any encoder output which corresponds to padding.

In [38]:
# Suppose this is our batch of vectorized target input sequences for the decoder.
# These values are just made up
target_input_seqs = [
    [1, 652, 723, 123, 62],
    [1, 25,  98, 129, 248, 215, 359, 249],
    [1, 2369, 1259, 125, 486],
]

In [39]:
# we need to pad out this batch so that all sequences within it are the same length
padded_target_input_seqs = tf.keras.preprocessing.sequence.pad_sequences(target_input_seqs, padding="post")
print("Padded target inputs to the decoder:")
print(padded_target_input_seqs.shape)
print(padded_target_input_seqs)

Padded target inputs to the decoder:
(3, 8)
[[   1  652  723  123   62    0    0    0]
 [   1   25   98  129  248  215  359  249]
 [   1 2369 1259  125  486    0    0    0]]


In [42]:
# We can create the padding mask the same way we did for the encoder
dec_padding_mask = tf.cast(tf.math.not_equal(padded_target_input_seqs, 0), tf.float32)
dec_padding_mask = dec_padding_mask[:, tf.newaxis, tf.newaxis, :]
print(dec_padding_mask)

tf.Tensor(
[[[[1. 1. 1. 1. 1. 0. 0. 0.]]]


 [[[1. 1. 1. 1. 1. 1. 1. 1.]]]


 [[[1. 1. 1. 1. 1. 0. 0. 0.]]]], shape=(3, 1, 1, 8), dtype=float32)


the look-ahead mask is a diagonal where the lower half are 1s and the upper half are zeros. This is easy to create using the *band_part* method:<br>
https://www.tensorflow.org/api_docs/python/tf/linalg/band_part

In [44]:
target_input_seq_len = padded_target_input_seqs.shape[1]
look_ahead_mask = tf.linalg.band_part(tf.ones((target_input_seq_len, 
                                               target_input_seq_len)), -1, 0)
print(look_ahead_mask)

tf.Tensor(
[[1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0.]
 [1. 1. 1. 1. 0. 0. 0. 0.]
 [1. 1. 1. 1. 1. 0. 0. 0.]
 [1. 1. 1. 1. 1. 1. 0. 0.]
 [1. 1. 1. 1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 1. 1. 1. 1.]], shape=(8, 8), dtype=float32)


To create the decoder mask, we just need to combine the padding and look-ahead masks. Note how the columns of the resulting decoder mask are all zero for padding positions.

In [45]:
dec_mask = tf.minimum(dec_padding_mask, look_ahead_mask)
print("The decoder mask:")
print(dec_mask)

The decoder mask:
tf.Tensor(
[[[[1. 0. 0. 0. 0. 0. 0. 0.]
   [1. 1. 0. 0. 0. 0. 0. 0.]
   [1. 1. 1. 0. 0. 0. 0. 0.]
   [1. 1. 1. 1. 0. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]]]


 [[[1. 0. 0. 0. 0. 0. 0. 0.]
   [1. 1. 0. 0. 0. 0. 0. 0.]
   [1. 1. 1. 0. 0. 0. 0. 0.]
   [1. 1. 1. 1. 0. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 1. 0. 0.]
   [1. 1. 1. 1. 1. 1. 1. 0.]
   [1. 1. 1. 1. 1. 1. 1. 1.]]]


 [[[1. 0. 0. 0. 0. 0. 0. 0.]
   [1. 1. 0. 0. 0. 0. 0. 0.]
   [1. 1. 1. 0. 0. 0. 0. 0.]
   [1. 1. 1. 1. 0. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]]]], shape=(3, 1, 8, 8), dtype=float32)


We can now declare a decoder and pass it everything it needs. In our case, the *memory* mask is the same as the *encoder* mask.

In [51]:
# num_blocks, d_model, num_head, hidden_dim, target_vocab_size, max_seq_len
decoder = Decoder(6, 12, 4, 48, 10_000, 8)

# encoder_output, target, training, decoder_mask, memory_mask
decoder_output, _ = decoder(encoder_output, padded_target_input_seqs,
                            True, dec_mask, enc_mask)

print(f"Decoder output {decoder_output.shape}:")
print(decoder_output)

Decoder output (3, 8, 12):
tf.Tensor(
[[[-1.0184541   0.26625583  0.11572924 -1.1485747  -0.952984
    0.8801752  -0.726415    0.8536025  -0.21449092  0.8939749
   -1.0893704   2.140551  ]
  [-0.9775529   0.8717778   0.32277203 -1.5117606   0.04323752
    1.0193734  -1.1903638   1.3521185  -0.19306883  1.0657094
   -1.4271342   0.62489164]
  [-0.30989093  0.7311962   0.5223486  -1.8963698  -0.28913856
    0.06362743 -0.8837329   1.0147457  -0.18362232  0.68921673
   -1.3008653   1.8424853 ]
  [-0.8711018   0.6517269   0.34115294 -1.5863849  -0.3171342
    0.462965   -0.8037953   0.8512697  -0.05725691  0.7320541
   -1.4009932   1.9974973 ]
  [-0.9566183   0.5102372   0.30296198 -1.8084775  -0.21189342
    0.47808024 -0.8638531   0.833414    0.08054852  0.63364947
   -1.0534005   2.0553513 ]
  [-1.0522851   0.70569533  0.04101041 -1.3774719  -0.35612231
    0.22495836 -0.68459797  0.9391469  -0.07410909  0.7755334
   -1.3008506   2.159093  ]
  [-1.5343648   0.9857499   0.5224991  -1.564

# Transformer

We now have all the pieces to build the **Transformer** itself, and it's pretty simple. 

In [52]:
class Transformer(tf.keras.Model):
  def __init__(self, num_blocks, d_model, num_heads, hidden_dim,
               source_vocab_size, target_vocab_size,
               max_input_len, max_target_len, dropout_rate=0.1):
    super().__init__()

    self.encoder = Encoder(num_blocks, d_model, num_heads, hidden_dim, 
                           source_vocab_size, max_input_len, dropout_rate)
    self.decoder = Decoder(num_blocks, d_model, num_heads, hidden_dim, 
                           target_vocab_size, max_target_len, dropout_rate)
    
    # The final dense layer to generate logits from the decoder output.
    self.output_layer = tf.keras.layers.Dense(target_vocab_size)

  def call(self, input_seqs, target_input_seqs, training, encoder_mask,
           decoder_mask, memory_mask):
    
    encoder_output, encoder_attn_weights = self.encoder(input_seqs, training, encoder_mask)
    decoder_output, decoder_attn_weights = self.decoder(encoder_output, 
                                                        target_input_seqs, training,
                                                        decoder_mask, memory_mask)
    
    return self.output_layer(decoder_output), encoder_attn_weights, decoder_attn_weights

In [54]:
transformer = Transformer(
    num_blocks = 6, 
    d_model = 12, 
    num_heads = 3, 
    hidden_dim = 48,
    source_vocab_size = bpemb_vocab_size, 
    target_vocab_size = 7000, # made-up target vocab size.
    max_input_len = padded_input_seqs.shape[1],
    max_target_len = padded_target_input_seqs.shape[1]
)

transformer_output, _, _ = transformer(padded_input_seqs, 
                                       padded_target_input_seqs,
                                       True,
                                       enc_mask,
                                       dec_mask,
                                       memory_mask=enc_mask)

print(f"Transformer output {transformer_output.shape}:")
print(transformer_output) # If training, we would use this output to calculate losses.

Transformer output (3, 8, 7000):
tf.Tensor(
[[[ 0.05081141 -0.01976041 -0.08560605 ... -0.04284935  0.01160289
    0.09640501]
  [ 0.03294214 -0.02145961 -0.08685193 ... -0.0535071  -0.00227882
    0.06122992]
  [-0.02347834 -0.0445135  -0.06411856 ... -0.04939048 -0.00934818
    0.02314242]
  ...
  [-0.00267017 -0.03308434 -0.06233474 ... -0.06370658 -0.01462648
    0.0490629 ]
  [ 0.0129272  -0.01463469 -0.09329622 ... -0.02132286  0.01828239
    0.01303657]
  [-0.03911912 -0.04686867 -0.07727265 ... -0.05215853 -0.01322378
   -0.00205435]]

 [[-0.06971866 -0.0376696  -0.06355062 ... -0.0368159   0.01256143
   -0.07938278]
  [-0.05778301 -0.05025372 -0.04995818 ... -0.05706041  0.00812414
   -0.03802201]
  [-0.02634483 -0.00221239 -0.07389838 ... -0.04657676 -0.04060621
    0.01504142]
  ...
  [-0.03229667 -0.01470375 -0.07107651 ... -0.06028854  0.00238242
   -0.04613585]
  [-0.03567777 -0.05639028 -0.03368648 ... -0.07142429  0.02247666
   -0.02796957]
  [-0.04204955 -0.04404774 -0

That's the whole original transformer from scratch. From here, if you want to train this transformer, Remember to use a learning rate warmup (Refer to the paper for more information on this).<br><br>

It's useful to know how these models work under the hood, but to train our own transformer to get impressive results is expensive. Both in terms of compute and data.<br>

Fortunately, there's a zoo of **pretrained** transformer models we can use. 

**Papers**<br>
Attention Is All You Need (original Transformer paper): https://arxiv.org/abs/1706.03762<br>

The Annotated Transformer: http://nlp.seas.harvard.edu/annotated-transformer/<br>

GPT-3: https://arxiv.org/abs/2005.14165<br>

BERT: https://arxiv.org/abs/1810.04805<br>

RoBERTa paper: https://arxiv.org/abs/1907.11692<br>

ALBERT paper: https://arxiv.org/abs/1909.11942<br>

DistilBert paper: https://arxiv.org/abs/1910.01108<br>

Electra paper: https://arxiv.org/abs/2003.10555<br>

XLM: https://arxiv.org/abs/1901.07291<br>