### Scaled dot-product attention

There are several ways to implement a self-attention layer, but the most
common one is scaled dot-product attention, from the paper introducing
the Transformer architecture.3 There are four main steps required to implement this mechanism:


* Project each token embedding into three vectors called query, key, and value.


* Compute attention scores. We determine how much the query and key
vectors relate to each other using a similarity function. As the name
suggests, the similarity function for scaled dot-product attention is
the dot product, computed efficiently using matrix multiplication of the
embeddings. Queries and keys that are similar will have a large
dot product, while those that don’t share much in common
will have little to no overlap. The outputs from this step are called
the attention scores, and for a sequence with n input
tokens there is a corresponding n × n matrix of attention scores.


* Compute attention weights. Dot products can in general produce
arbitrarily large numbers, which can destabilize the training process.
To handle this, the attention scores are first multiplied by a scaling
factor to normalize their variance and then normalized with a softmax to
ensure all the column values[…]

* “Update the token embeddings. Once the attention weights are computed,
we multiply them by the value vector to obtain an updated representation for embedding


In [5]:
from transformers import AutoTokenizer


model_ckpt = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
text = "time flies like an arrow"
# Exclude CLS and SEP special tokens to keep it simple
inputs = tokenizer(text, return_tensors='pt', add_special_tokens=False)
inputs.input_ids

tensor([[ 2051, 10029,  2066,  2019,  8612]])

In [6]:
# Create Dense Embeddings
import torch
from torch import nn 
from transformers import AutoConfig

# Get te config.json file associated with 'bert-base-uncased' 
config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb # size: (bert unique_ids x bert embedding_hidden_dim)

Embedding(30522, 768)

In [7]:
# Generate the embeddings 
input_embeds = token_emb(inputs.input_ids)
input_embeds.size() # size: (batch_size x seq_len x hidden_dim)

torch.Size([1, 5, 768])

In [8]:
from math import sqrt

# Create query, key and value vectors 
query = key = value = input_embeds
dim_k = key.size(-1)

# Create a 5x5 matrix of attention per sample in the batch 
scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
scores.size()

torch.Size([1, 5, 5])

In [9]:
import torch.nn.functional as F 
weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)

tensor([[1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

In [10]:
# Multiply the attention weights by the values
attn_outputs = torch.bmm(weights, value)
attn_outputs.size()

torch.Size([1, 5, 768])

In [11]:
def scaled_dot_product_attention(query, key, value): 
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k) 
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

### Multi-head Attention

In our simple example, we only used the embeddings “as is” to compute
the attention scores and weights, but that’s far from the
whole story. In practice, the self-attention layer applies three
independent linear transformations to each embedding to generate the
query, key, and value vectors. These transformations project the
embeddings and each projection carries its own set of learnable
parameters, which allows the self-attention layer to focus on different
semantic aspects of the sequence.

It also turns out to be beneficial to have multiple sets of linear
projections, each one representing a so-called attention head. The
resulting multi-head attention layer is illustrated in
Figure 3-5. But why do we need more than one
attention head? The reason is that the softmax of one head tends to
focus on mostly one aspect of similarity. Having several heads allows
the model to focus on several aspects at once. For instance, one head
can focus on subject-verb interaction, whereas another finds nearby
adjectives. Obviously we don’t handcraft these relations
into the model, and they are fully learned from the data. If you are
familiar with computer vision models you might see the resemblance to filters in
convolutional neural networks, where one filter can be responsible for
detecting faces and another one finds wheels of cars in images.


In [17]:
class AttentionHead(nn.Module): 
    def __init__(self, embed_dim, head_dim) -> None:
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    def forward(self, hidden_state) -> torch.Tensor: 
        return scaled_dot_product_attention(
            self.q(hidden_state), 
            self.k(hidden_state), 
            self.v(hidden_state)
        )


class MultiHeadAttention(nn.Module): 
    def __init__(self, config: dict) -> None:
        """
        config: BERT from tranformers config.json file 
        """
        super().__init__()
        embed_dim = config.hidden_size             # 768 
        num_heads = config.num_attention_heads     # 12 
        head_dim = embed_dim // num_heads          # 64 
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) 
                for _ in range(num_heads)]
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, hidden_state): 
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
        return self.output_linear(x) # size: (batch_size, seq_len, hidden_dim)


In [18]:
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(input_embeds)
attn_output.size()


torch.Size([1, 5, 768])

### The Feed-Forward Layer

The feed-forward sublayer in the encoder and decoder is just a simple
two-layer fully connected neural network, but with a twist: instead of
processing the whole sequence of embeddings as a single vector, it
processes each embedding independently. For this reason, this layer is
often referred to as a position-wise feed-forward layer. You may also
see it referred to as a one-dimensional convolution with a kernel size
of one, typically by people with a computer vision background (e.g., the
OpenAI GPT codebase uses this nomenclature). A rule of thumb from the
literature is for the hidden size of the first layer to be four times
the size of the embeddings, and a GELU activation function is most
commonly used. This is where most of the capacity and memorization is
hypothesized to happen, and it’s the part that is most often
scaled when scaling up the models. We can implement this as a simple
nn.Module as follows:

In [19]:
class FeedForward(nn.Module):
    def __init__(self, config: dict) -> None:
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(config.hidden_size, config.intermediate_size),
            nn.Linear(config.intermediate_size, config.hidden_size),
            nn.GELU(),
            nn.Dropout(config.hidden_dropout_prob)
        )

    def forward(self, x): 
        return self.net(x)


In [20]:
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_output)
ff_outputs.size()

torch.Size([1, 5, 768])

### Adding Layer Normalization

As mentioned earlier, the Transformer architecture makes use of layer
normalization and skip connections. The former normalizes each input
in the batch to have zero mean and unity variance. Skip connections pass
a tensor to the next layer of the model without processing and add it to
the processed tensor. When it comes to placing the layer normalization
in the encoder or decoder layers of a transformer, there are two main
choices adopted in the literature:

##### Post layer normalization

  This is the arrangement used in the
Transformer paper; it places layer normalization in between the skip
connections. This arrangement is tricky to train from scratch as the
gradients can diverge. For this reason, you will often see a concept
known as learning rate warm-up, where the learning rate is gradually
increased from a small value to some maximum value during
training.

##### Pre layer normalization

  This is the most common arrangement
found in the literature; it places layer normalization within the span
of the skip connections. This tends to be much more stable during
training, and it does not usually require any learning rate
warm-up.


In [21]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, config: dict) -> None:
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)

    def forward(self, x): 
        # Apply layer normalization and then copy input into query, key and value
        hidden_state = self.layer_norm_1(x)
        # Apply attention with a skeep connection 
        x = x + self.attention(hidden_state)
        #Apply feed-forward layer with a skip connection 
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x 

In [22]:
encoded_layer = TransformerEncoderLayer(config)
input_embeds.shape, encoded_layer(input_embeds).size()

(torch.Size([1, 5, 768]), torch.Size([1, 5, 768]))