# 5. Attention mechanisms and transformers

One major drawback of recurrent networks is that all words in a sequence have the same impact on the result. This
causes sub-optimal performance with standard LSTM encoder-decoder models for sequence to sequence tasks,
such as **Named Entity Recognition** and **Machine Translation**. In reality specific words in the input sequence
often have more impact on sequential outputs than others.

Consider sequence-to-sequence model, such as machine translation. It is implemented by two recurrent networks,
where one network (encoder) would collapse input sequence into hidden state, and another one, decoder, would unroll
this hidden state into translated result. The problem with this approach is that final state of the network would
have hard time remembering the beginning of a sentence, thus causing poor quality of the model on long sentences.

## 5.1 Attention mechanisms
Attention Mechanisms provide a means of weighting the contextual impact of each input vector on each output prediction
of the RNN. The way it is implemented is by creating shortcuts between intermediate states of the input RNN, and
output RNN. In this manner, when generating output symbol **y{t}**, we will take into account all input hidden states
**h{i}**, with different weight coefficients **α{t,i}**.
![encoder-decoder-attention](https://raw.githubusercontent.com/pengfei99/PyTorchTuto/main/notebooks/img/encoder-decoder-attention.png)

You can find more details about the encoder-decoder model with additive attention mechanism. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

### 5.1.1 Attention matrix
{αi,j} would represent the degree which certain input words play in generation of a given word in the output sequence.
Below is the example of such a matrix:
![attention-matrix](https://raw.githubusercontent.com/pengfei99/PyTorchTuto/main/notebooks/img/attention-matrix.png)

Read this paper, if you want to know more about attention mechanism. https://arxiv.org/pdf/1409.0473.pdf


Attention mechanisms are responsible for much of the current or near current state of the art in Natural language
processing. Adding attention however greatly increases the number of model parameters which led to scaling issues
with RNNs. A key constraint of scaling RNNs is that the recurrent nature of the models makes it challenging to
batch and parallelize training. In an RNN each element of a sequence needs to be processed in sequential order
which means it cannot be easily parallelized.

Adoption of attention mechanisms combined with this constraint led to the creation of the now **State of the Art
Transformer Models** that we know and use today from **BERT** to **OpenGPT3**.



## 5.2 Transformer models

Instead of forwarding the context of each previous prediction into the next evaluation step, transformer models use
positional encodings and attention to capture the context of a given input with in a provided window of text.
The image below shows how positional encodings with attention can capture context within a given window.

![transformer-animated-explanation](https://raw.githubusercontent.com/pengfei99/PyTorchTuto/main/notebooks/img/transformer-animated-explanation.gif)

Since each input position is mapped independently to each output position, transformers can parallelize better than
RNNs, which enables much larger and more expressive language models. Each attention head can be used to learn
different relationships between words that improves downstream Natural Language Processing tasks.

### 5.2.1 BERT (Bidirectional Encoder Representations from Transformers)
Bert is a very large multi layer transformer network with **12 layers for BERT-base, and 24 for BERT-large**. The model
is first pre-trained on large corpus of text data (WikiPedia + books) using unsupervised training (predicting
masked words in a sentence). During pre-training the model absorbs significant level of language understanding
which can then be leveraged with other datasets using fine-tuning. This process is called **transfer learning**.

![bert-language-modeling-masked-lm](https://raw.githubusercontent.com/pengfei99/PyTorchTuto/main/notebooks/img/bert-language-modeling-masked-lm.png)

There are many variations of Transformer architectures including:
- BERT
- DistilBERT
- BigBird
- OpenGPT3
- ETC.
And they all can be fine-tuned. The **HuggingFace**(https://github.com/huggingface/) package provides repository
for training many of these architectures with PyTorch.

### 5.2.2 Using BERT for text classification
Let's see how we can use pre-trained BERT model for solving the sequence classification. We will classify the original
AG News dataset.

1. load data
2. prepare data with bert tokenizer.encode
3. load the pre-trained bert model



In [1]:
import torch
import torchtext
import numpy as np
from torchnlp import *
from torchtext.vocab import vocab
from collections import Counter, OrderedDict
import transformers


#### Step 1 load news dataset


In [2]:
# Step 1: load our dataset:

# download data
def load_dataset(storage_path):
    print("Loading dataset...")
    train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root=storage_path)
    train_dataset = list(train_dataset)
    test_dataset = list(test_dataset)
    return train_dataset, test_dataset


path = "/tmp/pytorch/data"
train, test = load_dataset(path)

# build label class list
label_classes = ['World', 'Sports', 'Business', 'Sci/Tech']

Loading dataset...


#### Step 2 Prepare data loader with bert tokenizer
Because we will be using pre-trained BERT model, we would need to use **specific tokenizer**. First, we will load a
tokenizer associated with pre-trained BERT model.

HuggingFace library contains a repository of pre-trained models, which you can use just by specifying their names
as arguments to from_pretrained functions. All required binary files for the model would automatically be downloaded.

However, at certain times you would need to load your own models, in which case you can specify the directory that
contains all relevant files, including parameters for tokenizer, config.json file with model parameters,
binary weights, etc.



In [3]:
# Step 2: prepare data loader with bert tokenizer

# To load the model from Internet repository using model name.
# Use this if you are running from your own copy of the notebooks
bert_model_name = 'bert-base-uncased'

# To load the model from the directory on disk. Use this for Microsoft Learn module, because we have
# prepared all required files for you.
# bert_model_path = './bert'

bert_tokenizer = transformers.BertTokenizer.from_pretrained(bert_model_name)

MAX_SEQ_LEN = 128
PAD_INDEX = bert_tokenizer.convert_tokens_to_ids(bert_tokenizer.pad_token)
UNK_INDEX = bert_tokenizer.convert_tokens_to_ids(bert_tokenizer.unk_token)

In [4]:
# The tokenizer object from transformers lib contains the encode function that can be directly used to encode text:
bert_tokenizer.encode('PyTorch is a great framework for NLP')

[101, 1052, 22123, 2953, 2818, 2003, 1037, 2307, 7705, 2005, 17953, 2361, 102]

Then, let's create iterators which we will use during training to access the data. Because BERT uses it's own
encoding function, we would need to define a padding function that uses the bert tokenizer to transform text to tensors:

In [5]:
# padding function that uses the bert  tokenizer
def text_to_tensor_bert(b):
    # b is the list of tuples of length batch_size
    #   - first element of a tuple = label,
    #   - second = feature (text sequence)
    # build vectorized sequence
    v = [bert_tokenizer.encode(x[1]) for x in b]
    # compute max length of a sequence in this minibatch
    l = max(map(len, v))
    return (  # tuple of two tensors - labels and features
        torch.LongTensor([t[0] for t in b]),
        torch.stack([torch.nn.functional.pad(torch.tensor(t), (0, l - len(t)), mode='constant', value=0) for t in v])
    )


train_loader = torch.utils.data.DataLoader(train, batch_size=8, collate_fn=text_to_tensor_bert, shuffle=True)
test_loader = torch.utils.data.DataLoader(test, batch_size=8, collate_fn=text_to_tensor_bert)

#### Step 3 load the pre-trained bert model
In this example, we will use a pre-trained BERT model called **bert-base-uncased**. Let's load the model using
**BertForSequenceClassfication** package. This ensures that our model already has a required architecture for
classification, including final classifier. You will see warning message stating that weights of the final
classifier are not initialized, and model would require pre-training - that is perfectly okay, because it is
exactly what we are about to do!


In [6]:
def select_hardware_for_training(device_name):
    if device_name == 'cpu':
        return 'cpu'
    elif device_name == 'gpu':
        return 'cuda' if (device_name == "") & torch.cuda.is_available() else 'cpu'
    else:
        print("Unknown device name, choose cpu as default device")
        return 'cpu'


device = select_hardware_for_training("gpu")

bert_model = transformers.BertForSequenceClassification.from_pretrained(bert_model_name, num_labels=4).to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

#### Step 4 Train the pre-trained bert model
Now we are ready to begin training! Because BERT is already pre-trained, we want to start with rather small learning
rate in order not to destroy initial weights.

All hard work is done by BertForSequenceClassification model. When we call the model on the training data, it
returns both loss and network output for input mini_batch. We use loss for parameter optimization
(loss.backward() does the backward pass), and out for computing training accuracy by comparing obtained
labels labs (computed using argmax) with expected labels.

In order to control the process, we accumulate loss and accuracy over several iterations, and print them every
report_freq training cycles.

This training will likely take quite a long time, so we limit the number of iterations.

In [7]:
# define training loop
# make iteration larger to train for longer time!
def train_loop(model, dataloader, lr=2e-5, optimizer=None, iterations=500,
               report_freq=50):
    # optimizer can tune model parameter to improve accuracy
    optimizer = optimizer or torch.optim.Adam(model.parameters(), lr=lr)

    model.train()
    # counter for report activation
    i, c = 0, 0
    # loss and accuracy stores the model output for each model training step
    acc_loss = 0
    acc_acc = 0

    for labels, texts in dataloader:
        labels = labels.to(device) - 1  # get labels in the range 0-3
        texts = texts.to(device)
        loss, out = model(texts, labels=labels)[:2]
        predict_labels = out.argmax(dim=1)
        acc = torch.mean((predict_labels == labels).type(torch.float32))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        acc_loss += loss
        acc_acc += acc
        i += 1
        c += 1
        if i % report_freq == 0:
            print(f"Loss = {acc_loss.item() / c}, Accuracy = {acc_acc.item() / c}")
            c = 0
            acc_loss = 0
            acc_acc = 0
        # we will only learn from 500 text, if you increase iteration number, you can learn from more text sequence
        # but it will take more time.
        iterations -= 1
        if not iterations:
            break

In [None]:
train_loop(bert_model, train_loader)

You can see (especially if you increase the number of iterations and wait long enough) that BERT classification
gives us pretty good accuracy! That is because BERT already understands quite well the structure of the language,
and we only need to fine-tune final classifier. However, because BERT is a large model, the whole training process
takes a long time, and requires serious computational power! (GPU, and preferably more than one).

Note: In our example, we have been using one of the smallest pre-trained BERT models. There are larger models that
are likely to yield better results.

### Step 5 Evaluating the model performance
Now we can evaluate performance of our model on test dataset. Evaluation loop is pretty similar to training loop,
but we should not forget to switch model to evaluation mode by calling model.eval().

In [None]:
def eval_model(model, iterations=100):
    # set model mode to eval for not changing the weight
    model.eval()
    acc = 0
    i = 0
    for labels, texts in test_loader:
        labels = labels.to(device) - 1
        texts = texts.to(device)
        _, out = model(texts, labels=labels)[:2]
        labs = out.argmax(dim=1)
        acc += torch.mean((labs == labels).type(torch.float32))
        i += 1
        if i > iterations: break
    print(f"Final accuracy: {acc.item() / i}")

In [None]:
eval_model(bert_model)
