# Attention mechanisms and transformers

One big wahala wey dey wit recurrent networks be say all di words wey dey for one sequence dey get di same impact for di result. Dis one dey make di performance no too good wit di normal LSTM encoder-decoder models for sequence to sequence tasks, like Named Entity Recognition and Machine Translation. For real life, some words for di input sequence dey get more impact for di sequential outputs pass others.

Make we look sequence-to-sequence model, like machine translation. E dey work wit two recurrent networks, one network (**encoder**) go collapse di input sequence into hidden state, and di other one, **decoder**, go unroll di hidden state into di translated result. Di wahala wit dis method be say di final state of di network go struggle to remember di beginning of di sentence, and e go make di model no perform well for long sentences.

**Attention Mechanisms** dey help to give weight to di contextual impact of each input vector for each output prediction of di RNN. Di way dem dey do am na by creating shortcuts between di intermediate states of di input RNN, and di output RNN. So, when we dey generate output symbol $y_t$, we go consider all di input hidden states $h_i$, wit different weight coefficients $\alpha_{t,i}$. 

Di image below dey show encoder-decoder model wit additive attention layer:

![Image showing an encoder/decoder model with an additive attention layer](../../../../../translated_images/encoder-decoder-attention.7a726296894fb567aa2898c94b17b3289087f6705c11907df8301df9e5eeb3de.pcm.png)
*Di encoder-decoder model wit additive attention mechanism for [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf), wey dem take from [dis blog post](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)*

Di Attention matrix $\{\alpha_{i,j}\}$ go show how much certain input words dey contribute to di generation of one word for di output sequence. Below na example of di matrix:

![Image showing a sample alignment found by RNNsearch-50, taken from Bahdanau - arviz.org](../../../../../translated_images/bahdanau-fig3.09ba2d37f202a6af11de6c82d2d197830ba5f4528d9ea430eb65fd3a75065973.pcm.png)

*Di figure dey from [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf) (Fig.3)*

Attention mechanisms na di reason why Natural Language Processing dey perform well today or dey near di best level. But adding attention dey increase di number of model parameters well well, and e cause scaling wahala wit RNNs. One big problem wit scaling RNNs be say di recurrent nature of di models dey make am hard to batch and parallelize training. For RNN, each element of di sequence need to dey process one by one, so e no fit dey parallelize easily.

Di adoption of attention mechanisms plus dis wahala na wetin lead to di creation of di Transformer Models wey dey di best today, like BERT and OpenGPT3.

## Transformer models

Instead of passing di context of each previous prediction into di next evaluation step, **transformer models** dey use **positional encodings** and attention to capture di context of di input inside di given window of text. Di image below dey show how positional encodings wit attention fit capture context inside di window.

![Animated GIF showing how the evaluations are performed in transformer models.](../../../../../lessons/5-NLP/18-Transformers/images/transformer-animated-explanation.gif) 

Because each input position dey map independently to each output position, transformers fit parallelize better pass RNNs, and e dey allow bigger and more expressive language models. Each attention head fit dey used to learn different relationships between words wey dey improve Natural Language Processing tasks.

**BERT** (Bidirectional Encoder Representations from Transformers) na very big multi-layer transformer network wit 12 layers for *BERT-base*, and 24 for *BERT-large*. Di model dey first pre-train wit big corpus of text data (WikiPedia + books) using unsupervised training (predicting masked words for sentence). During di pre-training, di model dey learn plenty language understanding wey fit dey used wit other datasets through fine tuning. Dis process na wetin dem dey call **transfer learning**. 

![picture from http://jalammar.github.io/illustrated-bert/](../../../../../translated_images/jalammarBERT-language-modeling-masked-lm.34f113ea5fec4362e39ee4381aab7cad06b5465a0b5f053a0f2aa05fbe14e746.pcm.png)

Plenty variations of Transformer architectures dey, like BERT, DistilBERT, BigBird, OpenGPT3 and more wey fit dey fine-tuned. Di [HuggingFace package](https://github.com/huggingface/) dey provide repository for training plenty of dis architectures wit PyTorch. 

## Using BERT for text classification

Make we see how we fit use pre-trained BERT model to solve our normal task: sequence classification. We go classify our original AG News dataset.

First, make we load HuggingFace library and our dataset:


In [10]:
import torch
import torchtext
from torchnlp import *
import transformers
train_dataset, test_dataset, classes, vocab = load_dataset()
vocab_len = len(vocab)

Loading dataset...
Building vocab...


Bicos we go dey use pre-trained BERT model, we go need use one specific tokenizer. First, we go load tokenizer wey dey follow pre-trained BERT model.

HuggingFace library get one repository of pre-trained models, wey you fit use just by putting dia names as arguments for `from_pretrained` functions. All di binary files wey di model need go download automatic.

But sometimes, you go need load your own models. For dat kain case, you fit show di directory wey get all di files wey dey important, like di parameters for tokenizer, `config.json` file wey get di model parameters, binary weights, and di rest.


In [11]:
# To load the model from Internet repository using model name. 
# Use this if you are running from your own copy of the notebooks
bert_model = 'bert-base-uncased' 

# To load the model from the directory on disk. Use this for Microsoft Learn module, because we have
# prepared all required files for you.
bert_model = './bert'

tokenizer = transformers.BertTokenizer.from_pretrained(bert_model)

MAX_SEQ_LEN = 128
PAD_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
UNK_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)

The `tokenizer` object get the `encode` function wey fit directly use to encode text:


In [15]:
tokenizer.encode('PyTorch is a great framework for NLP')

[101, 1052, 22123, 2953, 2818, 2003, 1037, 2307, 7705, 2005, 17953, 2361, 102]

Den, make we create iterators wey we go use during training to take access di data. Because BERT dey use im own encoding function, we go need define one padding function wey be like `padify` we don define before:


In [4]:
def pad_bert(b):
    # b is the list of tuples of length batch_size
    #   - first element of a tuple = label, 
    #   - second = feature (text sequence)
    # build vectorized sequence
    v = [tokenizer.encode(x[1]) for x in b]
    # compute max length of a sequence in this minibatch
    l = max(map(len,v))
    return ( # tuple of two tensors - labels and features
        torch.LongTensor([t[0] for t in b]),
        torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v])
    )

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=8, collate_fn=pad_bert, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=8, collate_fn=pad_bert)

For our own case, we go use pre-trained BERT model wey dem dey call `bert-base-uncased`. Make we load di model wit `BertForSequenceClassfication` package. Dis one go make sure say di model don already get di architecture wey we need for classification, including di final classifier. You go see warning message wey go talk say di weights of di final classifier no dey initialized, and di model go need pre-training - dat one dey okay well well, because na wetin we wan do be dat!


In [9]:
model = transformers.BertForSequenceClassification.from_pretrained(bert_model,num_labels=4).to(device)

Some weights of the model checkpoint at ./bert were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./bert and

Now we don ready to start di training! Because BERT don already pre-train, we go wan start wit small learning rate so we no go scatter di initial weights.

Na `BertForSequenceClassification` model dey do all di hard work. Wen we run di model on top di training data, e go return both di loss and di network output for di input minibatch. We dey use di loss for parameter optimization (`loss.backward()` dey do di backward pass), and `out` to calculate di training accuracy by comparing di labels wey we get `labs` (wey we calculate using `argmax`) wit di expected `labels`.

To fit control di process, we dey gather di loss and accuracy for plenty iterations, and we dey print dem every `report_freq` training cycles.

Dis training fit take plenty time, so we dey limit di number of iterations.


In [6]:
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)

report_freq = 50
iterations = 500 # make this larger to train for longer time!

model.train()

i,c = 0,0
acc_loss = 0
acc_acc = 0

for labels,texts in train_loader:
    labels = labels.to(device)-1 # get labels in the range 0-3         
    texts = texts.to(device)
    loss, out = model(texts, labels=labels)[:2]
    labs = out.argmax(dim=1)
    acc = torch.mean((labs==labels).type(torch.float32))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    acc_loss += loss
    acc_acc += acc
    i+=1
    c+=1
    if i%report_freq==0:
        print(f"Loss = {acc_loss.item()/c}, Accuracy = {acc_acc.item()/c}")
        c = 0
        acc_loss = 0
        acc_acc = 0
    iterations-=1
    if not iterations:
        break

Loss = 1.1254194641113282, Accuracy = 0.585
Loss = 0.6194715118408203, Accuracy = 0.83
Loss = 0.46665248870849607, Accuracy = 0.8475
Loss = 0.4309701919555664, Accuracy = 0.8575
Loss = 0.35427074432373046, Accuracy = 0.8825
Loss = 0.3306886291503906, Accuracy = 0.8975
Loss = 0.30340143203735354, Accuracy = 0.8975
Loss = 0.26139299392700194, Accuracy = 0.915
Loss = 0.26708646774291994, Accuracy = 0.9225
Loss = 0.3667240524291992, Accuracy = 0.8675


You fit see (specially if you increase di number of iterations and wait well well) say BERT classification dey give us beta accuracy! Na because BERT don already sabi di structure of di language well, and we just need to fine-tune di final classifier. But, because BERT na big model, di whole training process dey take plenty time, and e need strong computational power! (GPU, and e go better if you get more than one).

> **Note:** For our example, we dey use one of di smallest pre-trained BERT models. Bigger models dey wey fit give beta results.


## Check how di model dey perform

Now we fit check how our model dey perform for di test dataset. Di way we go take check am dey similar to di way we take train am, but make we no forget to change di model to evaluation mode by calling `model.eval()`.


In [10]:
model.eval()
iterations = 100
acc = 0
i = 0
for labels,texts in test_loader:
    labels = labels.to(device)-1      
    texts = texts.to(device)
    _, out = model(texts, labels=labels)[:2]
    labs = out.argmax(dim=1)
    acc += torch.mean((labs==labels).type(torch.float32))
    i+=1
    if i>iterations: break
        
print(f"Final accuracy: {acc.item()/i}")

Final accuracy: 0.9047029702970297


## Takeaway

For dis unit, we don see how e easy to carry pre-trained language model from **transformers** library and use am for our text classification work. Same way, BERT models fit work for entity extraction, question answering, and other NLP tasks.

Transformer models na di current best for NLP, and for most cases, e suppose be di first solution wey you go try when you dey do custom NLP solutions. But, e dey very important to sabi di basic principles of recurrent neural networks wey we talk about for dis module if you wan build advanced neural models.


---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:  
Dis dokyument don use AI translet service [Co-op Translator](https://github.com/Azure/co-op-translator) do di translet. Even as we dey try make am correct, abeg make you sabi say machine translet fit get mistake or no dey accurate well. Di original dokyument wey dey for im native language na di one wey you go take as di correct source. For important mata, e good make you use professional human translet. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis translet.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
