# Transformer - Masking and Multihead Attention

In [1]:
import tensorflow as tf
import numpy as np

## Masking
There are two types of masks.
1. padding_mask \
Mask all the pad tokens in the batch of sequence. It ensures that the model does not treat padding as the input. The mask indicates where pad value 0 is present: it outputs a 1 at those locations, and a 0 otherwise.

1. look-ahead mask \
The look-ahead mask is used to mask the future tokens in a sequence. In other words, the mask indicates which entries should not be used.

In [3]:
def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
    #add extra dimensions to add the padding - brocasting
    # to the attention logits.
    return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)

In [6]:
x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
create_padding_mask(x)

<tf.Tensor: shape=(3, 1, 1, 5), dtype=float32, numpy=
array([[[[0., 0., 1., 1., 0.]]],


       [[[0., 0., 0., 1., 1.]]],


       [[[1., 1., 1., 0., 0.]]]], dtype=float32)>

### look-ahead mask
The look-ahead mask is used to mask the future tokens in a sequence. In other words, the mask indicates which entries should not be used.

This means that to predict the third token, only the first and second token will be used. Similarly to predict the fourth token, only the first, second and the third tokens will be used and so on.

In [9]:
def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask  # (seq_len, seq_len)


In [13]:
x = tf.random.uniform((1, 3))
temp = create_look_ahead_mask(x.shape[1]) #Here use 3
temp

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[0., 1., 1.],
       [0., 0., 1.],
       [0., 0., 0.]], dtype=float32)>

## Multihead Attention - Step by Step 

Multi-head attention consists of four parts:
- Linear layers.
- Scaled dot-product attention.
- Final linear layer.

Each multi-head attention block gets three inputs:
1. Q (query) - Just like what we search in goole, each search query.
1. K (key) - The following results.
1. V (value) - 

$$
\alpha = Q \cdot K
$$
Attention score $ \alpha $ : relationship.

These are put through linear (Dense) layers before the multi-head attention function. \
For simplicity/efficiency the code below implements this using a single dense layer with num_heads times as many outputs.
- The output is rearranged to a shape of (batch, **num_heads**, ...) before applying the attention function.

The scaled_dot_product_attention function defined above is applied in a single call, broadcasted for efficiency. An appropriate mask must be used in the attention step. The attention output for each head is then concatenated (using tf.transpose, and tf.reshape) and put through a final Dense layer.

Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information from different representation subspaces at different positions. After the split each head has a reduced dimensionality, so the total computation cost is the same as a single head attention with full dimensionality.

In [None]:
# Calculate wq. wk, wv
wq = tf.keras.layers.Dense(d_model)
wk = tf.keras.layers.Dense(d_model)
wv = tf.keras.layers.Dense(d_model)

#Calculate q, k, v
q = wq(q)  # (batch_size, seq_len, d_model)
k = wk(k)  # (batch_size, seq_len, d_model)
v = wv(v)  # (batch_size, seq_len, d_model)

