# BERT: Bidirectional Encoder Representations from Transformers

This notebook explores the hugely influential [BERT sentence encoder](https://arxiv.org/pdf/1810.04805) proposed by Google's team in 2018.

## 1. What is BERT?

BERT is an encoder-only transformer architecture that takes in a chunk of text, breaks it down into tokens and outputs one contextual embedding for each token. These embeddings will hopefully contain some information about the semantics and grammatical structure of human-written text, and can then be used with transfer learning to perform different kinds of tasks (e.g.sentiment analysis).

The usefulness of BERT comes form the fact that this huge model has already been pre-trained on many different corpora, which is usually an expensive and time-consuming endeavor. So we can just use these pre-trained weights, make some downstream modifications, train it on a more specific dataset and repurpose them cheaply to another task. This process is called 'fine-tuning'.

## 2. How is BERT trained?

BERT is trained self-supervisedly to take two sentences as inputs ($A$, $B$) and perform two simultenous tasks:
1. **Next Sentence Prediction (NSP)**: predicting whether $B$ follows $A$ (i.e. if the sentences are next to each other in the corpus)
2. **Masked Language Model (MLM)**: masking a few of the tokens from these sentences and using the model to predict which tokens were masked (e.g. 'I [MASK] your father' $\rightarrow$ model([MASK]) = 'am')

The inputs take the following format:

$$
[\text{CLS}] \;\; t_1^A \;\; t_2^A \;\; t_3^A \;\; ... \;\; t_{|A|}^A 
\;\; [\text{SEP}] \;\; t_1^B \;\; t_2^B \;\; t_3^B \;\; ... \;\; t_{|B|}^B 
$$

Here, $t_i^X$ represents the $i^{th}$ token of sentence X, [SEP] represents the special token used to separate the two sentences and [CLS] is another special token used in the beginning of the sequence, and whose goal is to encapsule some general encoding of the entire input (i.e. not only on a token level).

<center><img src="data/BERT_architecture.png" width=300 height=300></center>


## 3. What data was BERT trained on?

The original BERT model was trained only with English sentence-pairs taken from two different corpus: the BookCorpus (800M words) and English Wikipedia (2,500M words). There have been many updates done to BERT since its release, some focusing on language specific versions (e.g. CamemBERT for French) and multilingual versions (e.g. mBERT), but in this notebook, I'll use the original English-only model.

## 4. Using BERT

The most simple way to access BERT's architecture, tokenizer and pre-trained weights is using Hugging Face's library.

### 4.1 - BERT Tokenization

First, we load BERT's tokenizer, which will be used to convert a sentence into a sequence of token indexes.

In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

We can now check which tokens some of the token indexes represent.

In [2]:
for idx in [0, 101, 102, 103, 10030, 3029, 5142]:
    token = tokenizer.convert_ids_to_tokens(idx)
    print(f'idx: {idx:7}, token: ', token)

idx:       0, token:  [PAD]
idx:     101, token:  [CLS]
idx:     102, token:  [SEP]
idx:     103, token:  [MASK]
idx:   10030, token:  vacant
idx:    3029, token:  organization
idx:    5142, token:  concern


When applying the tokenizer to a sentence, we get a dictionary with:
- **input_ids**: the token indexes (including [CLS] and [SEP])
- **token_type_ids**: the sentence indexes (0 for first, 1 for second)
- **attention_mask**: 1 if the token should be used in the attention calculations

In [3]:
inputs = tokenizer("I was thirsty!", "I drank water :D", return_tensors="pt")
for key, value in inputs.items():
    print(f'{key}: \n{value}\n')

input_ids: 
tensor([[  101,  1045,  2001, 24907,   999,   102,  1045, 10749,  2300,  1024,
          1040,   102]])

token_type_ids: 
tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]])

attention_mask: 
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])



___

We can now see what tokens are being fed to BERT. Note that, since we are using the 'uncased' version of the model, the sentences are all converted to lower case before tokenization.

In [4]:
 tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
 print('tokens:', ' '.join(tokens))

tokens: [CLS] i was thirsty ! [SEP] i drank water : d [SEP]


### 4.2 - BERT Embeddings

First, we can import BERT from the Hugging Face repository. We will work with the lighted version (base) that disregards lower/upper case (uncased).

In [5]:
from transformers import BertModel

model = BertModel.from_pretrained("google-bert/bert-base-uncased")

BERT now uses this _inputs_ dictionary to build three separate embeddings:
- **position embeddings**: takes token position in the sequence into account (0,1,2,3,...)
- **token embeddings**: one different embedding for each token
- **sentence embeddings**: one different embedding for each sentence (0 and 1)

In [6]:
# Position embeddings
pos_embed = model.embeddings.position_embeddings

# Embedding dimension = max sequence length x dimension
print(f'dim. of position embeddings: {list(pos_embed.weight.shape)}')

# Embeddings print
pos_embed.weight

dim. of position embeddings: [512, 768]


Parameter containing:
tensor([[ 1.7505e-02, -2.5631e-02, -3.6642e-02,  ...,  3.3437e-05,
          6.8312e-04,  1.5441e-02],
        [ 7.7580e-03,  2.2613e-03, -1.9444e-02,  ...,  2.8910e-02,
          2.9753e-02, -5.3247e-03],
        [-1.1287e-02, -1.9644e-03, -1.1573e-02,  ...,  1.4908e-02,
          1.8741e-02, -7.3140e-03],
        ...,
        [ 1.7418e-02,  3.4903e-03, -9.5621e-03,  ...,  2.9599e-03,
          4.3435e-04, -2.6949e-02],
        [ 2.1687e-02, -6.0216e-03,  1.4736e-02,  ..., -5.6118e-03,
         -1.2590e-02, -2.8085e-02],
        [ 2.6413e-03, -2.3298e-02,  5.4922e-03,  ...,  1.7537e-02,
          2.7550e-02, -7.7656e-02]], requires_grad=True)

In [7]:
# Token embeddings
token_embed = model.embeddings.word_embeddings

# Embedding dimension = size of vocabulary x dimension
print(f'dim. of token embeddings: {list(token_embed.weight.shape)}')

# Embeddings print
token_embed.weight

dim. of token embeddings: [30522, 768]


Parameter containing:
tensor([[-0.0102, -0.0615, -0.0265,  ..., -0.0199, -0.0372, -0.0098],
        [-0.0117, -0.0600, -0.0323,  ..., -0.0168, -0.0401, -0.0107],
        [-0.0198, -0.0627, -0.0326,  ..., -0.0165, -0.0420, -0.0032],
        ...,
        [-0.0218, -0.0556, -0.0135,  ..., -0.0043, -0.0151, -0.0249],
        [-0.0462, -0.0565, -0.0019,  ...,  0.0157, -0.0139, -0.0095],
        [ 0.0015, -0.0821, -0.0160,  ..., -0.0081, -0.0475,  0.0753]],
       requires_grad=True)

In [8]:
# Sentence embeddings
sent_embed = model.embeddings.token_type_embeddings

# Embedding dimension = 2 x dimension
print(f'dim. of sentence embeddings: {list(sent_embed.weight.shape)}')

# Embeddings print
sent_embed.weight

dim. of sentence embeddings: [2, 768]


Parameter containing:
tensor([[ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086],
        [ 0.0011, -0.0030, -0.0032,  ...,  0.0047, -0.0052, -0.0112]],
       requires_grad=True)

The input to the BERT network is a sum of each of these embeddings for each token.

### 4.3 - BERT Outputs


Let's first take a look at BERT's architecture.
It consists of:
1. **Embedding**: the three embeddings discussed in 4.2
2. **BERT Layers**: 12 (base version) BERT Layers in series, and each of these layers contain a multiheaded attention layer, skip connections, layer normalization and a feedforward layer

In [9]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

___

In this implementation, the BERT model receives the _inputs_ dictionary and outputs two sets of vectors:
- **last_hidden_state**: the the final contextual embeddings for each input token (including [CLS] and [SEP])
- **pooler_output**: a layer normalization of the contextual embedding for [CLS]


In [10]:
outputs = model(**inputs)

Let's now print the first 10 dimensions of the output embedding of each token.

In [11]:
for i, idx in enumerate(inputs['input_ids'][0]):
    context_embed = outputs.last_hidden_state[0,i,0:10].tolist()
    context_embed = [round(emb,2) for emb in context_embed]
    print(f'{tokenizer.convert_ids_to_tokens([idx])[0]:10} -> {context_embed}\n')
    

[CLS]      -> [-0.23, 0.37, -0.34, -0.35, -0.91, 0.01, 0.67, 0.64, 0.1, -0.32]

i          -> [0.11, 0.19, 0.05, 0.25, -0.13, 0.56, -0.07, 1.19, -0.03, -0.78]

was        -> [0.45, 0.47, 0.29, 0.36, 0.07, 0.35, 0.37, 0.82, -0.54, -0.47]

thirsty    -> [0.88, 0.08, -0.0, -0.03, 1.2, 0.08, 0.37, 0.98, -0.37, 0.08]

!          -> [0.22, 0.1, -0.11, -0.29, 0.19, 0.44, 0.61, 0.16, -0.67, -0.38]

[SEP]      -> [0.6, 0.19, -0.24, 0.49, -0.5, -0.79, 0.73, -0.06, 0.46, 0.15]

i          -> [0.05, 0.29, 0.28, 0.03, -0.71, 0.5, 0.35, 1.22, 0.34, -0.32]

drank      -> [0.16, 0.34, 0.4, -0.17, -0.33, 0.24, 0.6, 0.89, 0.01, -0.3]

water      -> [0.22, 0.05, 0.1, -0.23, 0.09, -0.27, 0.31, 0.82, -0.35, 0.01]

:          -> [0.11, 0.39, -0.11, -0.63, -0.53, 0.44, 0.71, 0.21, 0.08, 0.23]

d          -> [0.15, 0.93, 0.78, -0.26, -0.7, 0.4, 1.27, -0.72, 0.36, -0.42]

[SEP]      -> [0.75, 0.15, -0.57, 0.65, -0.7, -0.76, 0.75, -0.26, 0.7, 0.01]



Let's check the pooler_output now. It is simply a layer normalization from [CLS]'s contextual embedding from above.

In [12]:
outputs.pooler_output[0][:10]

tensor([-0.9803, -0.7672, -0.9958,  0.9636,  0.9004, -0.5488,  0.9910,  0.6867,
        -0.9878, -1.0000], grad_fn=<SliceBackward0>)

___

There are 4 kinds of downstream tasks that we can use with BERT:
1. **1 sentence, a label for each token**: e.g. named entity recognition
    - use last_hidden_states
3. **1 sentence, a label for the entire sentence**: e.g. sentiment analysis
    - use pooler_output
5. **2 sentences, a label for each token**: e.g. information retrieval
    - use last_hidden_states
6. **2 sentences, a label for the pair**: e.g. find duplicate questions
    - use pooler_output

## 5 - Fine-Tuning BERT for Sentiment Analysis

We will now fine-tune BERT on sentiment analysis data, i.e. data that consists of sentences and a label of 0 (negative), 1 (neutral) or 2(positive).

In [13]:
import torch.nn as nn
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
import torch.nn.functional as F
from datasets import load_dataset


We will use MTEB's twitter sentiment analysis dataset.

In [14]:
label_dic = {
    0: 'negative',
    1: 'neutral',
    2: 'positive'
}

train = load_dataset("mteb/tweet_sentiment_extraction", split="train[:2000]")
print(f'train size: {len(train)}')

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


train size: 2000


___
The classifier will be simply a logistic regression that will be fitted on top of the **pooler_output** vector.

In [15]:
classifier = nn.Linear(768,3)

optim = torch.optim.Adam(classifier.parameters(), lr=1e-2)
loss_fn = torch.nn.CrossEntropyLoss()

train_loader = DataLoader(train, batch_size=32)

To make training faster, we will freeze the parameters from BERT itself. So, it will act as a static feature extractor, and only the logistic regression will be fitted.

In [16]:
for param in model.parameters():
    param.requires_grad = False

Training the classifier by first passing the sentences through BERT, taking the pooler_output vector and then applying the classifier to it.

In [17]:
# Number of epochs
n_epochs = 10

for i in range(n_epochs):

    # For each batch in the training set
    for batch in tqdm(train_loader):

        # Get text and label
        texts = batch['text']
        y = batch['label']

        # Applying BERT to texts and getting pooler_output
        inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
        outputs = model(**inputs)
        cls_token = outputs.pooler_output

        # Applying the classifier to pooler_output
        y_pred = classifier(cls_token)

        # Finding loss
        loss = loss_fn(y_pred,y)

        # Updating weights through backprop
        optim.zero_grad()
        loss.backward()
        optim.step()
        
    print(f'epoch: [{i+1}/{n_epochs}] | last batch training loss: {loss.item()}')

100%|███████████████████████████████████████████| 63/63 [00:26<00:00,  2.37it/s]


epoch: [1/10] | last batch training loss: 1.0779420137405396


100%|███████████████████████████████████████████| 63/63 [00:25<00:00,  2.50it/s]


epoch: [2/10] | last batch training loss: 0.9721745848655701


100%|███████████████████████████████████████████| 63/63 [00:25<00:00,  2.49it/s]


epoch: [3/10] | last batch training loss: 0.9489535093307495


100%|███████████████████████████████████████████| 63/63 [00:24<00:00,  2.54it/s]


epoch: [4/10] | last batch training loss: 0.9208956360816956


100%|███████████████████████████████████████████| 63/63 [00:24<00:00,  2.59it/s]


epoch: [5/10] | last batch training loss: 0.8892520666122437


100%|███████████████████████████████████████████| 63/63 [00:25<00:00,  2.44it/s]


epoch: [6/10] | last batch training loss: 0.8570788502693176


100%|███████████████████████████████████████████| 63/63 [00:24<00:00,  2.61it/s]


epoch: [7/10] | last batch training loss: 0.8284878134727478


100%|███████████████████████████████████████████| 63/63 [00:24<00:00,  2.58it/s]


epoch: [8/10] | last batch training loss: 0.8041089177131653


100%|███████████████████████████████████████████| 63/63 [00:24<00:00,  2.61it/s]


epoch: [9/10] | last batch training loss: 0.7828696966171265


100%|███████████████████████████████████████████| 63/63 [00:24<00:00,  2.60it/s]

epoch: [10/10] | last batch training loss: 0.7639827132225037





Now, let's test the trained model on some made-up sentences.

In [23]:
test_sentences = [
    'im very sad at the whole situation',
    'i woke up feeling amazing',
    'theres going to be a full eclipse 2day',
    'i wish i could have a different life sometimes. it gets hard',
    'this is great news!',
    'I just got a raise :o'
]

In [24]:
for sentence in test_sentences:
    
    inputs = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True)
    outputs = model(**inputs)
    cls_token = outputs.pooler_output
    logprobs = classifier(cls_token)
    predicted_label = F.softmax(logprobs, dim=-1).argmax()
    print(f'{sentence:20} -> {label_dic[predicted_label.item()]}\n')

im very sad at the whole situation -> negative

i woke up feeling amazing -> positive

theres going to be a full eclipse 2day -> neutral

i wish i could have a different life sometimes. it gets hard -> negative

this is great news!  -> positive

I just got a raise :o -> positive



From the examples above, it is clear that BERT has at the very least provided a good 'automatic feature engineering procedure' for the classsification head.