***Introduction to BERT***

Bi-directional Encoder Representation from Transformers

It is an auto-encoding language model, uses only the encoder from the transformer, relies on self-attention and the encoder is taken from the transformer architecture

Consider the following sentence:

"I love my pet python"

We feed this sentence into BERT to get a context-full representation (vector embedding) of every word in the sentence

The encoder understands the context of each word in the sentence using a multi-headed attention mechanism (which relates each word to every other word in the sentence)

In [1]:
# imports

import torch
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizer, BertModel


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Loading the model

model = BertModel.from_pretrained('bert-base-uncased')

In [3]:
# Getting the model parameters as a list of tuples

named_params = list(model.named_parameters())

In [4]:
print(f'The BERT model has {len(named_params)} different named parameters.\n')

print('==== Embedding layer ====\n')

for p in named_params[:5]:
    print('{:<55} {:>12}'.format(p[0], str(tuple(p[1].size()))))

for p in named_params[5:21]:
    print('{:<55} {:>12}'.format(p[0], str(tuple(p[1].size()))))

print('\n==== Output layer ====\n')

for p in named_params[-2:]:
    print('{:<55} {:>12}'.format(p[0], str(tuple(p[1].size()))))

The BERT model has 199 different named parameters.

==== Embedding layer ====

embeddings.word_embeddings.weight                       (30522, 768)
embeddings.position_embeddings.weight                     (512, 768)
embeddings.token_type_embeddings.weight                     (2, 768)
embeddings.LayerNorm.weight                                   (768,)
embeddings.LayerNorm.bias                                     (768,)
encoder.layer.0.attention.self.query.weight               (768, 768)
encoder.layer.0.attention.self.query.bias                     (768,)
encoder.layer.0.attention.self.key.weight                 (768, 768)
encoder.layer.0.attention.self.key.bias                       (768,)
encoder.layer.0.attention.self.value.weight               (768, 768)
encoder.layer.0.attention.self.value.bias                     (768,)
encoder.layer.0.attention.output.dense.weight             (768, 768)
encoder.layer.0.attention.output.dense.bias                   (768,)
encoder.layer.0.attentio

We can see BERT is aware of 30522 tokens that he can use for any NLP task, this includes the CLS and SEP token.

The 768 shows that each of those tokens has a pre-defined context-less embedding of 768 dimensions.

In [5]:
# Loading a tokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [6]:
# Enconding a simple sequence

tokenizer.encode('Sinan loves a beautiful day')

[101, 8254, 2319, 7459, 1037, 3376, 2154, 102]

In [7]:
# Run tokens through the model
# 1st Turn tokens of unkown words into a tensor (8,)
# 2nd Unsqueeze a first dimension to simulate batches (1, 8)

response = model(torch.tensor(tokenizer.encode('Sinan loves a beautiful day')).unsqueeze(0))

The tokenizer will create a list which we then pass to torch to create a tensor and then unsqueeze it on the 0th dimension before passing it to the model

In [8]:
# Embedding for each token

response.last_hidden_state

tensor([[[-0.2327,  0.1515, -0.0448,  ..., -0.5192,  0.4195,  0.2948],
         [ 0.3051, -0.6614,  0.2500,  ..., -0.9809,  0.2551,  0.2400],
         [-0.3610, -0.8759,  0.4542,  ..., -1.1120,  0.1791,  0.0664],
         ...,
         [ 0.0689, -0.0364,  0.4940,  ..., -0.6558,  0.2227, -0.3868],
         [-0.2657, -0.4257,  0.0056,  ...,  0.1352,  0.3596, -0.4585],
         [ 0.6100,  0.0263, -0.2532,  ..., -0.0680, -0.3901, -0.3541]]],
       grad_fn=<NativeLayerNormBackward0>)

Each one of these rows represents a token in our sequence, and each column represents that token's context within the greater sequence

In [9]:
response

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.2327,  0.1515, -0.0448,  ..., -0.5192,  0.4195,  0.2948],
         [ 0.3051, -0.6614,  0.2500,  ..., -0.9809,  0.2551,  0.2400],
         [-0.3610, -0.8759,  0.4542,  ..., -1.1120,  0.1791,  0.0664],
         ...,
         [ 0.0689, -0.0364,  0.4940,  ..., -0.6558,  0.2227, -0.3868],
         [-0.2657, -0.4257,  0.0056,  ...,  0.1352,  0.3596, -0.4585],
         [ 0.6100,  0.0263, -0.2532,  ..., -0.0680, -0.3901, -0.3541]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.8777, -0.4542, -0.6287,  0.7511,  0.3151, -0.0913,  0.9175,  0.3766,
         -0.3059, -1.0000, -0.0577,  0.7535,  0.9913,  0.2113,  0.9418, -0.5328,
         -0.0568, -0.5698,  0.4090, -0.6096,  0.7876,  0.9995,  0.3670,  0.2453,
          0.4620,  0.9465, -0.6802,  0.9342,  0.9614,  0.7060, -0.5755,  0.2076,
         -0.9910, -0.1697, -0.8019, -0.9952,  0.3786, -0.7309, -0.0599, -0.0186,
         -0.8722,  0.3377,  0.99

**Pooler Output**

This is meant to represent the whole sequence as a whole

In [10]:
response.pooler_output.shape

torch.Size([1, 768])

In [11]:
model.pooler

BertPooler(
  (dense): Linear(in_features=768, out_features=768, bias=True)
  (activation): Tanh()
)

In [12]:
# We grab the final encoder, the 12ths encoder representation of the CLS token

CLS_embedding = response.last_hidden_state[:, 0, :].unsqueeze(0)

We take the first element (CLS) and the second dimension (the tokens)

In [13]:
CLS_embedding

tensor([[[-2.3271e-01,  1.5150e-01, -4.4841e-02, -8.4486e-02, -2.5227e-01,
          -1.5574e-01,  4.2548e-01,  6.2580e-01, -1.6765e-01, -2.2078e-01,
          -4.8964e-02, -1.6687e-01,  3.7025e-01,  5.4881e-01,  5.2734e-02,
          -1.8367e-01, -1.0926e-01,  6.5831e-01,  4.5373e-01, -1.1532e-01,
           2.0391e-01, -4.1622e-01, -4.4383e-02, -1.0411e-01,  2.5143e-01,
          -3.3175e-01, -6.4523e-03,  2.4058e-01,  2.1243e-01, -7.7977e-02,
          -1.4318e-01,  2.5448e-01, -2.2074e-02,  9.2747e-02,  9.0562e-02,
          -1.2345e-01,  3.4437e-01, -2.0981e-01,  1.8461e-01,  4.8943e-01,
           6.5386e-02,  5.8490e-02,  2.9532e-01, -2.0740e-01,  4.4421e-02,
          -8.8770e-01, -2.2506e+00,  6.8181e-03,  5.4260e-02, -3.1604e-01,
           3.5553e-01,  1.1498e-02,  3.7963e-01,  2.6510e-01,  1.2690e-01,
           5.0096e-01, -6.3212e-01,  7.9531e-01,  1.0351e-01,  4.1282e-01,
           3.9759e-01, -2.2472e-01, -4.5974e-03,  5.6817e-02, -1.9067e-01,
           4.2596e-01, -3

In [14]:
CLS_embedding.shape

torch.Size([1, 1, 768])

One batch, One token, 768 dimensions in the final representation

In [15]:
model.pooler(CLS_embedding).shape

torch.Size([1, 768])

This is the vector representation of the entire sequence at large.

The model pooler is a feed forward network with a hyperbolic activation function.

One batch, 768 dimensional vector of the final representation of the sequence at large

In [16]:
# Running the embedding for CLS through the pooler gives the same output
# as the pooler output

(model.pooler(CLS_embedding) == response.pooler_output).all()

tensor(True)