`transformers` is a package enable you to train/load and create a model in hugging face in easy way.

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 4.1 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 32.8 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 5.8 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 49.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 46.6 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml


`bertviz` is package that support to make the visualization of sublayers of BERT model.

In [None]:
!pip install bertviz

Collecting bertviz
  Downloading bertviz-1.4.0-py3-none-any.whl (157 kB)
[?25l[K     |██                              | 10 kB 19.1 MB/s eta 0:00:01[K     |████▏                           | 20 kB 8.5 MB/s eta 0:00:01[K     |██████▎                         | 30 kB 5.9 MB/s eta 0:00:01[K     |████████▎                       | 40 kB 5.6 MB/s eta 0:00:01[K     |██████████▍                     | 51 kB 3.4 MB/s eta 0:00:01[K     |████████████▌                   | 61 kB 4.0 MB/s eta 0:00:01[K     |██████████████▋                 | 71 kB 4.0 MB/s eta 0:00:01[K     |████████████████▋               | 81 kB 4.3 MB/s eta 0:00:01[K     |██████████████████▊             | 92 kB 4.7 MB/s eta 0:00:01[K     |████████████████████▉           | 102 kB 4.0 MB/s eta 0:00:01[K     |██████████████████████▉         | 112 kB 4.0 MB/s eta 0:00:01[K     |█████████████████████████       | 122 kB 4.0 MB/s eta 0:00:01[K     |███████████████████████████     | 133 kB 4.0 MB/s eta 0:00:01[K  

Make sure you click `RESTART RUNTIME` buttom in order to enable to use package in runtime.

# 1. Visualization BERT sublayers

In [None]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)
text = "I am a machine learning engineer who is currently working on some big NLP projects"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)

Output hidden; open in https://colab.research.google.com to view.

# 2. Component is inside Transformer

## 2.1. Scaled dot-product attention

The first thing we need to do is tokenize the input text into list of indices by tokenizer.

In [2]:
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids

tensor([[ 1045,  2572,  1037,  3698,  4083,  3992,  2040,  2003,  2747,  2551,
          2006,  2070,  2502, 17953,  2361,  3934]])

In [3]:
text

'I am a machine learning engineer who is currently working on some big NLP projects'

Each indice in the input indices list is mapped to an unique word in dictionary. Those indices in the next step are projected into a new feature space that represents an embedding vector for each of them. Process of transformation is made of `torch.nn.Embedding` layer that acts as a look up table for each indice.

In [4]:
import torch.nn as nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb

Embedding(30522, 768)

Normally, BERT model transform each word into vector of 768 dimensionalities. Feed forward `inputs.input_ids` through `token_emb` to achive the matrix embedding of whole sequence with shape `(batch_size, seq_length, embedding_size)`.

In [5]:
input_embs = token_emb(inputs.input_ids)
input_embs.shape

torch.Size([1, 16, 768])

Next, we caculate the self-attention through `scaled_dot_product_attention()` function:

![](https://imgur.com/3CVYGDi.png)

Figure 1: Scale dot-product attention mechanism.

In [6]:
import torch
import torch.nn.functional as F
from math import sqrt

def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

`torch.bmm()` is a function that compute the batch matix multiplication. The batch dimension is kept outside and we only multiply two matrix based on two remain dimensions. In this case `weights` has shape `(batch, seq_length, seq_length)` and `value` has shape `(batch, seq_length, hidden_size)`. Thus in return, the output unchanges batch and multiply matrix `(seq_length, seq_length)` with `(seq_length, hidden_size)` to create `(seq_length, hidden_size)`. Finally output is `(batch, seq_length, hidden_size)`. 

In [None]:
query = key = value = input_embs
weighted_value = scaled_dot_product_attention(query, key, value)
weighted_value.shape

## 2.2. Multi-head Attention

weights and values vector are used as input to compute the final linear projection output values vector for each self-attention layer. That is not all story about attention idea. Further, we do self-attention multiple times and in parallelization that seem to be more benefical for model enable to study variety aspects of sentiment of sequence. Those process are carried in the same time, thus we can train and inference them faster on parallel GPUs system. Of course, it saves both the time and performance in return.

![](https://imgur.com/D6mLEJW.png)

Figure 2: Multi-head attention architecture.

We consider each linear combination which is a weighted value vector in the output of an attention layer like a head. Thus, multiple output vectors are named as multi-head attention output. They are concatenated in the next step and do linear projection again to get output with the same shape as the input of a sublayer. That is to guarantee we can apply multiple stacked sublayers in a deep sequence without error shape.

Firstly, we wrap self-attention in to a `nn.Module` under the name `AttentionHead()` in order to facilitate packaging module and reusing it in later.

In [None]:
class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    def forward(self, hidden_state):
        attn_outputs = scaled_dot_product_attention(
            self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
        return attn_outputs

Based on `AttentionHead()` class to initialize multiple-head and then concatenate them and do the linear projection.

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size # 768
        num_heads = config.num_attention_heads # 128
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
        x = self.output_linear(x)
        return x

In [None]:
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(input_embs)
attn_output.size()

## 2.3. Feed forward layer

Feed forward are two fully connected layers plugged after Multi-head Attention to make a complete sublayer of Transformer. They are just simply wrapped into `nn.Module` like that: 

In [None]:
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, x):
        x = self.linear_1(x)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

In [None]:
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_output)
ff_outputs.size()

## 2.4. A sublayer


In experiment we prove that models are faster convergence and approach to the optimal point when interleaves normalization between `Multi-head attention` layer and `Feed Forward` layer. There are two style of apply normalization:

* Post layer norm: Apply them after Multi-head attention layers and they are located outside skip connection.

* Pre layer norm: Norm layers are added right in front of Multi-head attention and are within skip connection range. 

![](https://imgur.com/b2hrwmi.png)

Figure 3: Post layer norm

![](https://imgur.com/fbSsI2F.png)

Figure 4: Pre layer norm

In below we apply in `pre layer norm`.

In [None]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)

    def forward(self, x):
        # Apply layer normalization and then copy input into query, key, value
        hidden_state = self.layer_norm_1(x)
        # Apply attention with a skip connection
        x = x + self.attention(hidden_state)
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

In [None]:
encoder_layer = TransformerEncoderLayer(config)
encoder_layer(input_embs).size()

We draw a remark that the output shape of the whole process of the sublayer is the same as the input shape

## 2.5. Positional Embedding

Positional embeddings are based on a simple, yet very effective idea: augment the token embeddings with a position-dependent pattern of values arranged in a vector. If the pattern is characteristic for each position, the attention heads and feed-forward layers in each stack can learn to incorporate positional information in their transformations.

There are several ways to achieve this and one of the most popular approaches, especially when the pretraining dataset is sufficiently large, is to use a learnable pattern. This works exactly the same way as the token embeddings but using the position index instead of the token ID as input. With that approach an efficient way of encoding the position of tokens is learned during pretraining.

In [None]:
class Embeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embeddings = nn.Embedding(config.vocab_size,
                                             config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
                                               config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()

    def forward(self, input_ids):
        # Create position IDs for input sequence
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
        # create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

In [None]:
embedding_layer = Embeddings(config)
embedding_layer(inputs.input_ids).size()

# 3. Full Encoder

Now we have all module that are necessary to build a complete Encoder. In the next step, we adapt those modules to a pipeline which applies positional embedding in the first and forwards to number of sublayers in the following.

In [None]:
class TransformerEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerEncoderLayer(config)
                                     for _ in range(config.num_hidden_layers)])

    def forward(self, x):
        x = self.embeddings(x)
        for layer in self.layers:
            x = layer(x)
        return x

In [None]:
encoder = TransformerEncoder(config)
encoder(inputs.input_ids).size()

# 4. Bodies and Heads

So now that we have a full transformer encoder model we would like to build a classifier with it. The model is usually divided into a task independant body and a task specific head. What we’ve built so far is the body and we now need to attach a classification head to that body. Just simply add Linear Projection:

In [None]:
class TransformerForSequenceClassification(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.encoder = TransformerEncoder(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, x):
        x = self.encoder(x)[:, 0, :]
        x = self.dropout(x)
        x = self.classifier(x)
        return x

In [None]:
config.num_labels = 2
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size()

# 5. Transformer Decoder

The decoder has two attention sublayers:

**Masked multi-head attention:** Ensures that the tokens we generate at each timestep are only based on the past outputs and the current token being predicted. Without this, the decoder could cheat during training by simply copying the target translations, so masking the inputs ensures the task is not trivial.

**Encoder-decoder attention:** Performs multi-head attention over the output key and value vectors of the encoder stack, with the intermediate representation of the decoder acting as the queries. This way the encoder-decoder attention layer learns how to relate tokens from two different sequences such as two different languages.

![](https://imgur.com/ttdW8nt.png)

Figure 5: Decoder architecture.


Let’s take a look at the modifications we need to include masking in self-attention, and leave the implementation of the encoder-decoder attention layer as a homework problem. The trick with masked self-attention is to introduce a mask matrix with ones on the lower diagonal and zeros above:

In [None]:
seq_len = inputs.input_ids.size(-1)
mask = torch.tril(torch.ones(seq_len, seq_len)).view(1, seq_len, seq_len)
mask[0]

Here we’ve used PyTorch’s tril function to create the lower triangular matrix. Once we have this mask matrix, we can the prevent each attention head from peeking at future tokens by using `torch.Tensor.masked_fill` to replace all the zeros with negative infinity:

In [None]:
def scaled_dot_product_attention(query, key, value, mask=None):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)
    return weights.bmm(value)

In [None]:
class AttentionHeadMasked(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    def forward(self, x, e_k, e_v):
        '''
        x: input in decoder
        e_k: keys vector from encoder
        e_v: values vector from encoder
        '''
        batch_size, seq_len, chanel = x.shape
        mask = torch.tril(torch.ones(batch_size, seq_len, seq_len))
        # Truncate mask, e_k, e_v to current position of word.
        mask = mask[:, :seq_len, :seq_len]
        e_k = e_k[:, :seq_len, :]
        e_v = e_v[:, :seq_len, :]
        attn_outputs = scaled_dot_product_attention(
            self.q(x), self.k(e_k), self.v(e_v), mask)
        return attn_outputs

In [None]:
class MultiHeadAttentionMasked(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size # 768
        num_heads = config.num_attention_heads # 128
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList(
            [AttentionHeadMasked(embed_dim, head_dim) for _ in range(num_heads)]
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, x, e_h, e_v):
        '''
        x: input in decoder
        e_k: keys vector from encoder
        e_v: values vector from encoder
        '''
        x = torch.cat([h(x, e_h, e_v) for h in self.heads], dim=-1)
        x = self.output_linear(x)
        return x

In [None]:
multihead_attn_msk = MultiHeadAttentionMasked(config)
input_embs = token_emb(inputs.input_ids)
e_k = e_v = encoder(inputs.input_ids)
# Assume that we only touch to 4'th position of words in sequence.
attn_output_dec = multihead_attn_msk(input_embs[:,:4, :], e_k, e_v)
attn_output_dec.size()

In [None]:
class TransformerDecoderLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttentionMasked(config)
        self.feed_forward = FeedForward(config)

    def forward(self, x, e_k, e_v):
        '''
        x: input in decoder
        e_k: keys vector from encoder
        e_v: values vector from encoder
        '''
        # Apply layer normalization and then copy input into query, key, value
        hidden_state = self.layer_norm_1(x)
        # Apply attention with a skip connection
        x = x + self.attention(hidden_state, e_k, e_v)
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

In [None]:
decoder_layer = TransformerDecoderLayer(config)
# Assume that we only touch to 4'th position of words in sequence.
decoder_layer(input_embs[:,:4, :], e_k, e_v).size()

In [None]:
class TransformerDecoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerDecoderLayer(config)
                                     for _ in range(config.num_hidden_layers)])

    def forward(self, x, e_k, e_v):
        '''
        x: input in decoder
        e_k: keys vector from encoder
        e_v: values vector from encoder
        '''
        for layer in self.layers:
            x = layer(x, e_k, e_v)
        return x

In [None]:
decoder = TransformerDecoder(config)
decoder(input_embs[:,:4, :], e_k, e_v).size()