# Attention mechanisms and transformers

One big wahala wey dey wit recurrent networks be say all di words wey dey for one sequence dey get di same impact for di result. Dis one dey make di performance no too good wit di normal LSTM encoder-decoder models for sequence to sequence tasks, like Named Entity Recognition and Machine Translation. For real life, some words for di input sequence dey get more impact for di sequential outputs pass others.

Make we look sequence-to-sequence model, like machine translation. E dey work wit two recurrent networks, one network (**encoder**) go collapse di input sequence into hidden state, and di other one, **decoder**, go unroll di hidden state into di translated result. Di wahala wit dis method be say di final state of di network go struggle to remember di beginning of di sentence, and e go make di model no perform well for long sentences.

**Attention Mechanisms** dey help to give weight to di contextual impact of each input vector for each output prediction of di RNN. Di way dem dey do am na by creating shortcuts between di intermediate states of di input RNN, and di output RNN. So, when we dey generate output symbol $y_t$, we go consider all di input hidden states $h_i$, wit different weight coefficients $\alpha_{t,i}$. 

![Image showing an encoder/decoder model with an additive attention layer](../../../../../translated_images/encoder-decoder-attention.7a726296894fb567aa2898c94b17b3289087f6705c11907df8301df9e5eeb3de.pcm.png)
*Di encoder-decoder model wit additive attention mechanism for [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf), wey dem show for [dis blog post](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)*

Attention matrix $\{\alpha_{i,j}\}$ go show how much certain input words dey contribute to di generation of one word for di output sequence. Below na example of di matrix:

![Image showing a sample alignment found by RNNsearch-50, taken from Bahdanau - arviz.org](../../../../../translated_images/bahdanau-fig3.09ba2d37f202a6af11de6c82d2d197830ba5f4528d9ea430eb65fd3a75065973.pcm.png)

*Figure wey dem take from [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf) (Fig.3)*

Attention mechanisms na di reason why we dey get di current or near current state of di art for Natural language processing. But to add attention go increase di number of model parameters well well, and e cause scaling wahala wit RNNs. One big limitation of scaling RNNs be say di recurrent nature of di models dey make am hard to batch and parallelize training. For RNN, each element of di sequence need to dey process one by one, and e mean say e no fit dey parallelize easily.

Di adoption of attention mechanisms plus dis limitation na wetin lead to di creation of di State of di Art Transformer Models wey we dey use today like BERT and OpenGPT3.

## Transformer models

Instead of passing di context of each previous prediction into di next evaluation step, **transformer models** dey use **positional encodings** and **attention** to capture di context of di input inside di given window of text. Di image below dey show how positional encodings wit attention fit capture context inside di given window.

![Animated GIF showing how the evaluations are performed in transformer models.](../../../../../lessons/5-NLP/18-Transformers/images/transformer-animated-explanation.gif) 

Because each input position dey map independently to each output position, transformers fit parallelize better pass RNNs, and e dey allow bigger and more expressive language models. Each attention head fit dey used to learn different relationships between words wey dey improve downstream Natural Language Processing tasks.

## Building Simple Transformer Model

Keras no get built-in Transformer layer, but we fit build our own. As before, we go focus on text classification of AG News dataset, but e good to mention say Transformer models dey show di best result for more difficult NLP tasks.


In [1]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np

ds_train, ds_test = tfds.load('ag_news_subset').values()

def extract_text(x):
    return x['title']+' '+x['description']

def tupelize(x):
    return (extract_text(x),x['label'])

New layas for Keras suppose subclass `Layer` class, and dem go implement `call` method. Make we start wit **Positional Embedding** layer. We go use [some code from official Keras documentation](https://keras.io/examples/nlp/text_classification_with_transformer/). We go assume say we don pad all input sequences to length `maxlen`.


In [2]:
class TokenAndPositionEmbedding(keras.layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = keras.layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = keras.layers.Embedding(input_dim=maxlen, output_dim=embed_dim)
        self.maxlen = maxlen

    def call(self, x):
        maxlen = self.maxlen
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x+positions

Dis layer get two `Embedding` layers: one for embedding tokens (like we don talk before) and another one for token positions. Token positions na sequence of natural numbers from 0 go reach `maxlen` wey dem use `tf.range` create, and dem go pass am through embedding layer. The two embedding vectors wey dem get go join together, and e go produce positionally-embedded representation of input wey get shape `maxlen`$\times$`embed_dim`.

<img src="../../../../../translated_images/pos-embedding.e41ce9b6cf6078afd28da02f27e33ac7026ed4c156491df7ad9aa96be7c194bb.pcm.png" width="40%"/>

Now, make we implement the transformer block. E go use the output wey the embedding layer we don define before produce:


In [3]:
class TransformerBlock(keras.layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim, name='attn')
        self.ffn = keras.Sequential(
            [keras.layers.Dense(ff_dim, activation="relu"), keras.layers.Dense(embed_dim),]
        )
        self.layernorm1 = keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = keras.layers.Dropout(rate)
        self.dropout2 = keras.layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

Transformer dey use `MultiHeadAttention` for di positionally-encoded input to produce di attention vector wey get dimension `maxlen`$\times$`embed_dim`, wey dem go mix with di input and normalize am using `LayerNormalization`.

> **Note**: `LayerNormalization` be like `BatchNormalization` wey dem talk about for di *Computer Vision* part of dis learning path, but e dey normalize di output of di previous layer for each training sample one by one, to make dem dey inside di range [-1..1].

Di output of dis layer go then pass through `Dense` network (for our case - two-layer perceptron), and di result go join di final output (wey go still undergo normalization again).

<img src="../../../../../translated_images/transformer-layer.905e14747ca4e7d5cf1409e8bf8944c9b1d6e4f5ce3ab167918af65c4904d727.pcm.png" width="30%" />

Now, we don ready to define di complete transformer model:


In [4]:
embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer
maxlen = 256
vocab_size = 20000

model = keras.models.Sequential([
    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,output_sequence_length=maxlen, input_shape=(1,)),
    TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim),
    TransformerBlock(embed_dim, num_heads, ff_dim),
    keras.layers.GlobalAveragePooling1D(),
    keras.layers.Dropout(0.1),
    keras.layers.Dense(20, activation="relu"),
    keras.layers.Dropout(0.1),
    keras.layers.Dense(4, activation="softmax")
])

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, 256)               0         
_________________________________________________________________
token_and_position_embedding (None, 256, 32)           648192    
_________________________________________________________________
transformer_block (Transform (None, 256, 32)           10656     
_________________________________________________________________
global_average_pooling1d (Gl (None, 32)                0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 20)                660       
_________________________________________________________________
dropout_3 (Dropout)          (None, 20)               

In [5]:
print('Training tokenizer')
model.layers[0].adapt(ds_train.map(extract_text))
model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(128),validation_data=ds_test.map(tupelize).batch(128))

Training tokenizer


<tensorflow.python.keras.callbacks.History at 0x7f9c2427a0d0>

## BERT Transformer Models

**BERT** (Bidirectional Encoder Representations from Transformers) na one big transformer network wey get 12 layers for *BERT-base*, and 24 for *BERT-large*. Dem first train di model with plenty text data (WikiPedia + books) wey no get supervision (dem dey predict di words wey dem hide for sentence). As dem dey train di model, e dey learn plenty tins about language wey fit help am when dem wan use am with other datasets by fine-tuning. Dis process na wetin dem dey call **transfer learning**.

![picture from http://jalammar.github.io/illustrated-bert/](../../../../../translated_images/jalammarBERT-language-modeling-masked-lm.34f113ea5fec4362e39ee4381aab7cad06b5465a0b5f053a0f2aa05fbe14e746.pcm.png)

Plenty Transformer architectures dey like BERT, DistilBERT, BigBird, OpenGPT3 and others wey person fit fine-tune.

Make we see how we fit use pre-trained BERT model take solve our normal sequence classification problem. We go borrow di idea and some code from [official documentation](https://www.tensorflow.org/text/tutorials/classify_text_with_bert).

To load pre-trained models, we go use **Tensorflow hub**. First, make we load di BERT-specific vectorizer:


In [1]:
import tensorflow_text 
import tensorflow_hub as hub
vectorizer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')

ModuleNotFoundError: No module named 'tensorflow_text'

In [7]:
vectorizer(['I love transformers'])

{'input_type_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int32)>,
 'input_word_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[  101,  1045,  2293, 19081,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0, 

E dey important say you use di same vectorizer wey dem take train di original network. Plus, BERT vectorizer dey return three components:

* `input_word_ids`, wey be sequence of token numbers for di input sentence  
* `input_mask`, wey dey show which part of di sequence get real input, and which one na padding. E be like di mask wey `Masking` layer dey produce  
* `input_type_ids` wey dem dey use for language modeling tasks, and e go allow you put two input sentences inside one sequence.  

After dat, we fit create BERT feature extractor:  


In [8]:
bert = hub.KerasLayer('https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1')

In [9]:
z = bert(vectorizer(['I love transformers']))
for i,x in z.items():
    print(f"{i} -> { len(x) if isinstance(x, list) else x.shape }")

pooled_output -> (1, 128)
encoder_outputs -> 4
sequence_output -> (1, 128, 128)
default -> (1, 128)


So, BERT layer dey return plenty useful results:

* `pooled_output` na di result wey dem take average all di tokens for di sequence. You fit see am as one kind smart semantic embedding of di whole network. E be like di output of `GlobalAveragePooling1D` layer for di model wey we do before.

* `sequence_output` na di output of di last transformer layer (e match di output of `TransformerBlock` for di model wey we talk about up).

* `encoder_outputs` na di outputs of all di transformer layers. Since we load 4-layer BERT model (as you fit guess from di name wey get `4_H`), e get 4 tensors. Di last one na di same as `sequence_output`.

Now, we go define di end-to-end classification model. We go use *functional model definition*, wey mean say we go define di model input, and then we go provide series of expressions to calculate di output. We go also make di BERT model weights no dey trainable, and we go train only di final classifier:


In [10]:
inp = keras.Input(shape=(),dtype=tf.string)
x = vectorizer(inp)
x = bert(x)
x = keras.layers.Dropout(0.1)(x['pooled_output'])
out = keras.layers.Dense(4,activation='softmax')(x)
model = keras.models.Model(inp,out)
bert.trainable = False
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None,)]            0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        {'input_type_ids': ( 0           input_1[0][0]                    
__________________________________________________________________________________________________
keras_layer_1 (KerasLayer)      {'pooled_output': (N 4782465     keras_layer[0][0]                
                                                                 keras_layer[0][1]                
                                                                 keras_layer[0][2]                
______________________________________________________________________________________________

In [11]:
model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(128),validation_data=ds_test.map(tupelize).batch(128))



<tensorflow.python.keras.callbacks.History at 0x7f9bb1e36d00>

Even though say trainable parameters no plenty, di process still slow wella, because BERT feature extractor dey use plenty computation. E be like say we no fit get beta accuracy, maybe na because training no dey enough, or model parameters no plenty.

Make we try unfreeze BERT weights and train am join. Dis one go need very small learning rate, and e go also need more careful training strategy wey get **warmup**, using **AdamW** optimizer. We go use `tf-models-official` package to create di optimizer:


In [12]:
from official.nlp import optimization 
bert.trainable=True
model.summary()
epochs = 3
opt = optimization.create_optimizer(
    init_lr=3e-5,
    num_train_steps=epochs*len(ds_train),
    num_warmup_steps=0.1*epochs*len(ds_train),
    optimizer_type='adamw')

model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer=opt)
model.fit(ds_train.map(tupelize).batch(128),validation_data=ds_test.map(tupelize).batch(128))

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None,)]            0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        {'input_type_ids': ( 0           input_1[0][0]                    
__________________________________________________________________________________________________
keras_layer_1 (KerasLayer)      {'pooled_output': (N 4782465     keras_layer[0][0]                
                                                                 keras_layer[0][1]                
                                                                 keras_layer[0][2]                
______________________________________________________________________________________________

<tensorflow.python.keras.callbacks.History at 0x7f9bb0bd0070>

As you fit see, di training dey go slow small-small - but you fit wan try experiment and train di model for few epochs (5-10) and see if e go give beta result compared to di ways wey we don use before.

## Huggingface Transformers Library

Another way wey common well-well (and e dey simple small) to use Transformer models na [HuggingFace package](https://github.com/huggingface/), wey dey provide simple building blocks for different NLP tasks. E dey available for both Tensorflow and PyTorch, another neural network framework wey people dey use well-well.

> **Note**: If you no wan see how Transformers library dey work - you fit jump go di end of dis notebook, because you no go see anything wey dey too different from wetin we don do before. We go dey repeat di same steps to train BERT model using different library and bigger model. So, di process go involve some long training, so you fit just wan look di code. 

Make we see how we fit solve our problem using [Huggingface Transformers](http://huggingface.co).


Di first tin wey we go do na to choose di model wey we go use. Apart from di built-in models, Huggingface get one [online model repository](https://huggingface.co/models), wey you fit find plenty pre-trained models wey di community don create. All di models fit load and use just by providing di model name. All di binary files wey di model need go automatically download.

Sometimes, you go need load your own models. For dat case, you fit specify di folder wey get all di files wey you need, like di parameters for tokenizer, `config.json` file wey get di model parameters, binary weights, and di rest.

From di model name, we fit create both di model and di tokenizer. Make we start with di tokenizer:


In [2]:
import transformers

# To load the model from Internet repository using model name. 
# Use this if you are running from your own copy of the notebooks
bert_model = 'bert-base-uncased' 

# To load the model from the directory on disk. Use this for Microsoft Learn module, because we have
# prepared all required files for you.
#bert_model = './bert'

tokenizer = transformers.BertTokenizer.from_pretrained(bert_model)

MAX_SEQ_LEN = 128
PAD_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
UNK_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)

The `tokenizer` object get the `encode` function wey fit directly use to encode text:


In [3]:
tokenizer.encode('Tensorflow is a great framework for NLP')

[101, 23435, 12314, 2003, 1037, 2307, 7705, 2005, 17953, 2361, 102]

We fit use tokenizer take encode sequence wey go fit pass give model, like `token_ids`, `input_mask` fields, etc. We fit also talk say we want Tensorflow tensors by adding `return_tensors='tf'` argument:


In [4]:
tokenizer(['Hello, there'],return_tensors='tf')

{'input_ids': <tf.Tensor: shape=(1, 5), dtype=int32, numpy=array([[ 101, 7592, 1010, 2045,  102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(1, 5), dtype=int32, numpy=array([[0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 5), dtype=int32, numpy=array([[1, 1, 1, 1, 1]], dtype=int32)>}

For our case, we go use pre-trained BERT model wey dem call `bert-base-uncased`. *Uncased* mean say di model no dey case-sensitive.

When we dey train di model, we go need provide tokenized sequence as input, so we go design data processing pipeline. Since `tokenizer.encode` na Python function, we go use di same method wey we use for di last unit by calling am with `py_function`:


In [31]:
def process(x):
    return tokenizer.encode(x.numpy().decode('utf-8'),return_tensors='tf',padding='max_length',max_length=MAX_SEQ_LEN,truncation=True)[0]

def process_fn(x):
    s = x['title']+' '+x['description']
    e = tf.py_function(process,inp=[s],Tout=(tf.int32))
    e.set_shape(MAX_SEQ_LEN)
    return e,x['label']

Now we fit load di real model wey dey use `BertForSequenceClassfication` package. Dis one go make sure say di model don already get di architecture wey e need for classification, plus di final classifier. You go see warning message wey go talk say di weights of di final classifier no dey initialized, and di model go need pre-training - dat one dey okay well well, because na wetin we wan do be dat!


In [32]:
model = transformers.TFBertForSequenceClassification.from_pretrained(bert_model,num_labels=4,output_attentions=False)

In [33]:
model.summary()

Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_75 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  3076      
Total params: 109,485,316
Trainable params: 109,485,316
Non-trainable params: 0
_________________________________________________________________


As you fit see from `summary()`, di model get almost 110 million parameters! E mean say, if we wan do simple classification work for small dataset, we no go wan train di BERT base layer:


In [34]:
model.layers[0].trainable = False
model.summary()

Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_75 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  3076      
Total params: 109,485,316
Trainable params: 3,076
Non-trainable params: 109,482,240
_________________________________________________________________


Now we don ready to start training!

> **Note**: To train full-scale BERT model fit take plenty time! So we go only train am for di first 32 batches. Dis na just to show how model training dey set up. If you wan try full-scale training - just commot `steps_per_epoch` and `validation_steps` parameters, and ready to wait!


In [30]:
model.compile('adam','sparse_categorical_crossentropy',['acc'])
tf.get_logger().setLevel('ERROR')
model.fit(ds_train.map(process_fn).batch(32),validation_data=ds_test.map(process_fn).batch(32),steps_per_epoch=32,validation_steps=2)



<tensorflow.python.keras.callbacks.History at 0x7f1d40a4b6a0>

If you increase di number of iterations and wait well-well, and train for plenty epochs, you fit expect say BERT classification go give us di best accuracy! Na because BERT don already sabi di structure of di language well, and we just need to fine-tune di final classifier. But, because BERT na big model, di whole training process dey take plenty time, and e need serious computational power! (GPU, and e better make e pass one).

> **Note:** For our example, we dey use one of di smallest pre-trained BERT models. Bigger models dey wey fit give better results.


## Takeaway

For dis unit, we don see di latest model architectures wey base on **transformers**. We don use dem for our text classification task, but di same way, BERT models fit work for entity extraction, question answering, and other NLP tasks.

Transformer models na di current state-of-the-art for NLP, and for most cases, na di first solution wey you suppose start to dey try when you wan do custom NLP solutions. But, to sabi di basic principles of recurrent neural networks wey we discuss for dis module dey very important if you wan build advanced neural models.


---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:  
Dis docu don dey translate wit AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). Even though we dey try make am accurate, abeg sabi say automatic translation fit get mistake or no correct well. Di original docu for im native language na di main correct source. For important information, e good make una use professional human translation. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because of dis translation.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
