<div align="center">
<a href="https://vbti.nl"><img src="https://docs.google.com/uc?export=download&id=1DdCGllL51O5wBuiI0rwygofKx3YIDPHX" width="400"></a>
</div>

# Contextualized Embeddings with a Transformer

In the previous notebooks, we have used the Word2Vec model to learn the embeddings from the text. The embeddings are learned in the unsupervised setting, by predicting the word given the context (or the context given a word). While these embeddings are quite useful, they have one major downside: once learned, they are separated from the context. A single word can mean different things when surrounded by different words. For example, the word "transformer" could refer to an electrical device, Optimus Prime, or neural network architecture. It is impossible to determine the meaning of the word without its context. The idea behind ELMo presented in [original paper](https://arxiv.org/pdf/1802.05365.pdf) utilizes this idea and forms **contextualized embeddings** for the word taking into account its neighbors. It does so by employing bi-directional LSTM, which looks at words that came before and after to form the embeddings. As we have already learned recurrent neural networks are very hard to train due to the backpropagation through time. This brings to major downsides of the RNN: 1) not being able to utilize parallel computations and 2) difficulty with retaining long-term dependencies. 

With the release of the Transformer paper, called [Attention is All You Need](https://arxiv.org/abs/1706.03762), this architecture started to dominate the NLP world. The Transformer architecture is parallelizable, which allows the utilization of multiple GPUs and TPUs, and also it can model long-term relationships with the novel attention mechanism. Unfortunately, we are not able to dive deep into the topics of transformers, since they deserve their own Masterclass (or two).

If you want to understand better how the transformer works or why is it so good for modeling long-term dependencies, check a visual description of the transformers here: https://jalammar.github.io/illustrated-transformer/.

We are going to use a specific transformer architecture called Bidirectional Encoder Representations from Transformers (BERT). The original paper from Google Research can be found here: https://arxiv.org/pdf/1810.04805.pdf. For a more informal and visual guide check out Illustrated BERT: http://jalammar.github.io/illustrated-bert/.

In this notebook, we are going to have a look at how we can use pre-trained BERT for a custom classification task. We are going to use [`transformers`](https://pypi.org/project/transformers/) package from [HuggingFace](https://huggingface.co/transformers/index.html). There we can find a great amount of pre-trained models that are easily reusable and fine-tunable to almost any NLP task. Thanks to the Abhishek Thakur and his [book](https://www.amazon.com/Approaching-Almost-Machine-Learning-Problem-ebook/dp/B089P13QHT) for the inspiration.

Learning goals:
 - Learn how to use a pre-trained tokenizer
 - Learn how to prepare the dataset for BERT
 - Learn how to fine-tune BERT for a classification task

## Importing Libraries

In [None]:
!pip install transformers

In [1]:
import pandas as pd
import numpy as np

import torch.nn as nn

from sklearn import model_selection
from sklearn import metrics

from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

from tqdm.notebook import tqdm

## Defining Main Parameters

In [2]:
import transformers
from transformers import AdamW, get_linear_schedule_with_warmup

# this is the maximum number of tokens in the sentence
MAX_LEN = 512
# batch sizes is small because model is huge!
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
# let's train for a maximum of 10 epochs
EPOCHS = 10
# define path to BERT model files
BERT_PATH = "bert-base-uncased"
# this is where you want to save the model
MODEL_PATH = "model.bin"
# training file
TRAINING_FILE = "https://raw.githubusercontent.com/illyakaynov/masterclass-nlp/master/Case-IMBD_reviews/IMDB.csv"
# define the tokenizer
# we use tokenizer and model
# from huggingface's transformers
TOKENIZER = transformers.BertTokenizer.from_pretrained(
 BERT_PATH,
 do_lower_case=True
)

One of the things that we have defined here is the `BERT_PATH`. This is the name of the model that we are going to use. HuggingFace has implemented dozens of models, which were pre-trained on a big corpus of data. Check out the available models [here](https://huggingface.co/transformers/pretrained_models.html). The `base` part means this is a smaller model with 12 transformer blocks, 12 attention heads, and 768 dimensions for the Feed Forward Network layer. The `large` version has 24 transformer blocks with 16 attention heads and 1024 hidden units for FNN.

We also need to define the `TOKENIZER` which will transform the words in a sentence from strings into integers. We would need to use the same tokenizer that was used during BERT's training, otherwise, the mapping from words to integers will be different which will cause problems. The tokenizer has a nice method `encode_plus()` which will take our reviews and return the encoded version.

In [3]:
example_sentence = 'This is an example of a sentence to tokenize'
tokenized_sentence = TOKENIZER.encode_plus(
    example_sentence,
    None,
    add_special_tokens=True,
    truncation=True,
    max_length=15,
)
print(tokenized_sentence['input_ids'])

[101, 2023, 2003, 2019, 2742, 1997, 1037, 6251, 2000, 19204, 4697, 102]


We can map the tokens back to a sentence with `convert_ids_to_tokens()`.

In [4]:
print(TOKENIZER.convert_ids_to_tokens(tokenized_sentence['input_ids']))

['[CLS]', 'this', 'is', 'an', 'example', 'of', 'a', 'sentence', 'to', 'token', '##ize', '[SEP]']


The attribute `pad_to_max_length=True`, `max_length=15` made sure that the sequences are padded with zeros so that they match specified length. You can leave `max_length=None` so that they match the maximum length used for model training, in the case of BERT is 512.

You can also notice that there are special symbols added:
- [CLS] - stands for classification
- [SEP] - indicates the end of the sentence
- [PAD] - padding values

There are more values that are returned by a tokenizer in a dictionary. Lets check all the keys:

In [5]:
tokenized_sentence.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

`token_type_ids` is the remnant of two sentence tasks. In this case, zeros will be set for tokens from the first sentence and ones for the second. `attention_mask` will indicate which values the model needs to take into account when calculating attention, i.e. ignoring the padded tokens.

In [6]:
print(example_sentence.split())
print(tokenized_sentence['input_ids'])
print(tokenized_sentence['token_type_ids'])
print(tokenized_sentence['attention_mask'])
print(TOKENIZER.convert_ids_to_tokens(tokenized_sentence['input_ids']))

['This', 'is', 'an', 'example', 'of', 'a', 'sentence', 'to', 'tokenize']
[101, 2023, 2003, 2019, 2742, 1997, 1037, 6251, 2000, 19204, 4697, 102]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
['[CLS]', 'this', 'is', 'an', 'example', 'of', 'a', 'sentence', 'to', 'token', '##ize', '[SEP]']


## Dataset Preparation
We are going to use IMDB movie review dataset. `BERTDataset` class will be responsible for preprocessing the reviews so that they can be used as input into our model.

In [7]:
import torch
class BERTDataset:
    def __init__(self, review, target):
        """
        :param review: list or numpy array of strings
        :param targets: list or numpy array which is binary
        """
        self.review = review
        self.target = target
        
        self.tokenizer = TOKENIZER
        self.max_len = MAX_LEN
        
    def __len__(self):
        # this returns the length of dataset
        return len(self.review)
    
    def __getitem__(self, item):
        # for a given item index, return a dictionary
        # of inputs
        review = str(self.review[item])
        review = " ".join(review.split())
        # encode_plus comes from hugginface's transformers
        # and exists for all tokenizers they offer
        # it can be used to convert a given string
        # to ids, mask and token type ids which are
        # needed for models like BERT
        # here, review is a string
        inputs = self.tokenizer.encode_plus(
            review,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True
        )
        # ids are ids of tokens generated
        # after tokenizing reviews
        ids = inputs["input_ids"]
        # mask is 1 where we have input
        # and 0 where we have padding
        mask = inputs["attention_mask"]
        # token type ids behave the same way as
        # mask in this specific case
        # in case of two sentences, this is 0
        # for first sentence and 1 for second sentence
        token_type_ids = inputs["token_type_ids"]
        # now we return everything
        # note that ids, mask and token_type_ids
        # are all long datatypes and targets is float
        return {
            "ids": torch.tensor(
                ids, dtype=torch.long
            ),
            "mask": torch.tensor(
                mask, dtype=torch.long
            ),
            "token_type_ids": torch.tensor(
                token_type_ids, dtype=torch.long
            ),
            "targets": torch.tensor(
                self.target[item], dtype=torch.float
            )
        }



Lets have a look at the example dataset. We are going to use the familiar `torch.utils.data.DataLoader` to form batches for training and validation. But first lets turn encode the labels in classes as well. 

In [8]:
df = pd.read_csv(TRAINING_FILE)
# encode labels
df.sentiment = df.sentiment.apply(
    lambda x: 1 if x == "positive" else 0
)

In [9]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
...,...,...
49995,I thought this movie did a down right good job...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0
49997,I am a Catholic taught in parochial elementary...,0
49998,I'm going to have to disagree with the previou...,0


In [10]:
example_dataset = BERTDataset(df.review.values, df.sentiment.values)
example_data_loader = torch.utils.data.DataLoader(
    example_dataset,
    batch_size=4,
    num_workers=0
)

sample = next(iter(example_data_loader))

In [11]:
sample

{'ids': tensor([[  101,  2028,  1997,  ...,     0,     0,     0],
         [  101,  1037,  6919,  ...,     0,     0,     0],
         [  101,  1045,  2245,  ...,     0,     0,     0],
         [  101, 10468,  2045,  ...,     0,     0,     0]]),
 'mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]),
 'targets': tensor([1., 1., 1., 0.])}

## Pre-trained Transformer

In [12]:
model = transformers.BertModel.from_pretrained(BERT_PATH)

In [13]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [14]:
last_hidden_state, pool = model(
    sample['ids'],
    attention_mask=sample['mask'],
    token_type_ids=sample['token_type_ids'],
    return_dict=False
)

Lets have a look at the shapes. The last hidden state has the the shape (batch_size, seq_length, n_hidden). While the `pooler` has shape of (batch_size, n_dim).

In [15]:
last_hidden_state.shape

torch.Size([4, 512, 768])

In [16]:
pool.shape

torch.Size([4, 768])

By default, BERT returns two outputs: the last hidden state and the output of the pooler layer. The pooled output is produced by processing all contextual embeddings in a sequence with a Feed-Forward Network. If the last hidden state contains all **contextual embeddings** for each word in a sequence, then the pooler layer is an embedding of a document, or in this case the review. The only thing that is left to do is to train an additional Dense layer to separate these documents into the two categories. You can check out a nice visualization of all embeddings formed by BERT in this [article](https://towardsdatascience.com/visualize-bert-sequence-embeddings-an-unseen-way-1d6a351e4568).

In [17]:
# remove the model from the memory
del model
torch.cuda.empty_cache()

## Building a Model for Classification
Now we will take the pre-trained BERT model and encapsulate it in a class. As discussed we will add a single `Linear` layer at the end.

In [18]:
import transformers
import torch.nn as nn
class BERTBaseUncased(nn.Module):
    def __init__(self):
        super(BERTBaseUncased, self).__init__()
        self.bert = transformers.BertModel.from_pretrained(BERT_PATH)
        # add a dropout for regularization
        self.bert_drop = nn.Dropout(0.3)
        # a simple linear layer for output
        # yes, there is only one output
        self.out = nn.Linear(768, 1)
    def forward(self, ids, mask, token_type_ids, return_dict=False):
        # BERT in its default settings returns two outputs
        # last hidden state and output of bert pooler layer
        # we use the output of the pooler which is of the size
        # (batch_size, hidden_size)
        # hidden size can be 768 or 1024 depending on
        # if we are using bert base or large respectively
        # in our case, it is 768
        # note that this model is pretty simple
        # you might want to use last hidden state
        # or several hidden states
        _, o2 = self.bert(
            ids,
            attention_mask=mask,
            token_type_ids=token_type_ids,
            return_dict=return_dict
        )
        # pass through dropout layer
        bo = self.bert_drop(o2)
        # pass through linear layer
        output = self.out(bo)
        # return output
        return output
    


We are also going to define a loss function and the training method. For the loss, we are going to use binary cross entropy with logits, since we training the model for binary classification. The training method will look very similar to what you have used already. Check out the comments in the code for precise steps.

In [19]:
import torch
import torch.nn as nn
def loss_fn(outputs, targets):
    """
    This function returns the loss.
    :param outputs: output from the model (real numbers)
    :param targets: input targets (binary)
    """
    return nn.BCEWithLogitsLoss()(outputs, targets.view(-1, 1))

def train_fn(data_loader, model, optimizer, device, scheduler):
    """
    This is the training function which trains for one epoch
    :param data_loader: it is the torch dataloader object
    :param model: torch model, bert in our case
    :param optimizer: adam, sgd, etc
    :param device: can be cpu or cuda
    :param scheduler: learning rate scheduler
    """
    # put the model in training mode
    model.train()
    # loop over all batches
    for d in tqdm(data_loader):
        # extract ids, token type ids and mask
        # from current batch
        # also extract targets
        ids = d["ids"]
        token_type_ids = d["token_type_ids"]
        mask = d["mask"]
        targets = d["targets"]
        # move everything to specified device
        ids = ids.to(device, dtype=torch.long)
        token_type_ids = token_type_ids.to(device, dtype=torch.long)
        mask = mask.to(device, dtype=torch.long)
        targets = targets.to(device, dtype=torch.float)
        # zero-grad the optimizer
        optimizer.zero_grad()
        # pass through the model
        outputs = model(
            ids=ids,
            mask=mask,
            token_type_ids=token_type_ids,
            return_dict=False

        )
        # calculate loss
        loss = loss_fn(outputs, targets)
        # backward step the loss
        loss.backward()
        # step optimizer
        optimizer.step()
        # step scheduler
        scheduler.step()

In [27]:
def eval_fn(data_loader, model, device):
    """
    this is the validation function that generates
    predictions on validation data
    :param data_loader: it is the torch dataloader object
    :param model: torch model, bert in our case
    :param device: can be cpu or cuda
    :return: output and targets
    """
    # put model in eval mode
    model.eval()
    # initialize empty lists for
    # targets and outputs
    fin_targets = []
    fin_outputs = []
    # use the no_grad scope
    # its very important else you might
    # run out of gpu memory
    with torch.no_grad():
        # this part is same as training function
        # except for the fact that there is no
        # zero_grad of optimizer and there is no loss
        # calculation or scheduler steps.
        for d in tqdm(data_loader):
            ids = d["ids"]
            token_type_ids = d["token_type_ids"]
            mask = d["mask"]
            targets = d["targets"]
            ids = ids.to(device, dtype=torch.long)
            token_type_ids = token_type_ids.to(device, dtype=torch.long)
            mask = mask.to(device, dtype=torch.long)
            targets = targets.to(device, dtype=torch.float)
            outputs = model(
                ids=ids,
                mask=mask,
                token_type_ids=token_type_ids,
                return_dict=False
            )
            # convert targets to cpu and extend the final list
            targets = targets.cpu().detach()
            fin_targets.extend(targets.numpy().tolist())
            # convert outputs to cpu and extend the final list
            outputs = torch.sigmoid(outputs).cpu().detach()
            fin_outputs.extend(outputs.numpy().tolist())
    return fin_outputs, fin_targets

In [21]:
def train(model, train_data_loader, valid_data_loader, n_epochs=1, lr=3e-5, device='cuda'):
    # this function trains the model
    device = torch.device(device)
    model.to(device)
    # create parameters we want to optimize
    # we generally dont use any decay for bias
    # and weight layers
    param_optimizer = list(model.named_parameters())
    no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
    optimizer_parameters = [
        {
            "params": [
                p for n, p in param_optimizer if
                not any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.001,
        },
        {
            "params": [
                p for n, p in param_optimizer if
                any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.0,
        },
    ]
    # calculate the number of training steps
    # this is used by scheduler
    num_train_steps = int(
        len(df_train) / TRAIN_BATCH_SIZE * EPOCHS
    )
    # AdamW optimizer
    # AdamW is the most widely used optimizer
    # for transformer based networks
    optimizer = AdamW(optimizer_parameters, lr=3e-5)
    # fetch a scheduler
    # you can also try using reduce lr on plateau
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=0,
        num_training_steps=num_train_steps
    )
    # if you have multiple GPUs
    # model model to DataParallel
    # to use multiple GPUs
    model = nn.DataParallel(model)
    # start training the epochs
    best_accuracy = 0
    for epoch in range(EPOCHS):
        train_fn(
            train_data_loader, model, optimizer, device, scheduler
        )
        outputs, targets = eval_fn(
            valid_data_loader, model, device
        )
        outputs = np.array(outputs) >= 0.5
        accuracy = metrics.accuracy_score(targets, outputs)
        print(f"Accuracy Score = {accuracy}")
        if accuracy > best_accuracy:
            torch.save(model.state_dict(), MODEL_PATH)
            best_accuracy = accuracy

In [22]:
# read the training file and fill NaN values with "none"
# you can also choose to drop NaN values
dfx = pd.read_csv(TRAINING_FILE).fillna("none")
# sentiment = 1 if its positive
# else sentiment = 0
dfx.sentiment = dfx.sentiment.apply(
    lambda x: 1 if x == "positive" else 0
)
# we split the data into single training
# and validation fold
df_train, df_valid = model_selection.train_test_split(
    dfx,
    test_size=0.1,
    random_state=42,
    stratify=dfx.sentiment.values
)
# reset index
df_train = df_train.reset_index(drop=True)
df_valid = df_valid.reset_index(drop=True)

# for training dataset
train_dataset = BERTDataset(
    review=df_train.review.values,
    target=df_train.sentiment.values
)
# create training dataloader
train_data_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=TRAIN_BATCH_SIZE,
    num_workers=0
)

# for validation dataset
valid_dataset = BERTDataset(
    review=df_valid.review.values,
    target=df_valid.sentiment.values
)
# create validation data loader
valid_data_loader = torch.utils.data.DataLoader(
    valid_dataset,
    batch_size=VALID_BATCH_SIZE,
    num_workers=0
)

In [None]:
bert_model = BERTBaseUncased()
train(bert_model, train_data_loader, valid_data_loader, n_epochs=1)

The training will take approximately 40 min per epoch. Training even for one epoch will give a nice result.

We have also included pre-trained weights. That you can use. Next cell will download the weights for you.

In [1]:
def download_file(url, path):
    """
    Download file and save it to the defined location
    
    https://stackoverflow.com/questions/37573483/progress-bar-while-download-file-over-http-with-requests/37573701
    """
    import requests
    from tqdm.notebook import tqdm
    import os
    
    
    if os.path.exists(path):
        print('File "{}" already exists. Skipping download.'.format(path))
        return
    
    response = requests.get(url, stream=True)
    total_size_in_bytes= int(response.headers.get('content-length', 0))
    block_size = 1024 #1 Kibibyte
    progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True)
    with open(path, 'wb') as file:
        for data in response.iter_content(block_size):
            progress_bar.update(len(data))
            file.write(data)
    progress_bar.close()
    if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
        print("ERROR, something went wrong")
        
download_file(
    'https://github.com/illyakaynov/masterclass_datasets/raw/master/RNN%20and%20Transformers/BERT_IMDB.bin',
    'model.bin'
)

File "model.bin" already exists. Skipping download.


In [23]:
def load_model(path):
    model = BERTBaseUncased()
    model = nn.DataParallel(model)
    model.load_state_dict(torch.load(path))
    return model

bert_model = load_model('model.bin')

In [28]:
outputs, targets = eval_fn(
            valid_data_loader, bert_model, 'cuda'
        )
outputs = np.array(outputs) >= 0.5
accuracy = metrics.accuracy_score(targets, outputs)
print(f"Accuracy Score = {accuracy}")

  0%|          | 0/1250 [00:00<?, ?it/s]

Accuracy Score = 0.947


## Next steps
- Use different model
- Adjust the implementation for a multi-class classification problem
- Use the model for different task, for example: question answering, or machine translation.