***BERT's Embedding Layer***

Token embeddings:
- Represents context-less meaning of each token
- A lookup of 30522 possible vectors (for BERT-base)
- This is learnable during training

Context-less tokens

Segement embeddings:
- Distinguishes between multiple inputs (for Q/A for example)
- A lookup of 2 possible vectors (one for sentence A and one for sentence B)
- This is not learnable

Whether the token comes from sentence A or B

Position embeddings:
- Used to represent the token's position in the sentence
- This is not learnable

The position of the token within the sequence

In [17]:
# imports

import torch
from transformers import BertModel, BertTokenizer

In [3]:
# Loading model

model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [4]:
# Looking at the model embeddings

model.embeddings

BertEmbeddings(
  (word_embeddings): Embedding(30522, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (token_type_embeddings): Embedding(2, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

In [5]:
example_phrase = 'I am Sinan'

In [6]:
# We will this time convert it directly to pytorch tensors

tokenizer.encode(example_phrase, return_tensors='pt')

tensor([[ 101, 1045, 2572, 8254, 2319,  102]])

In [8]:
# Context-less embedding of each word in our sentence

model.embeddings.word_embeddings(
    tokenizer.encode(example_phrase, return_tensors='pt'))

tensor([[[ 0.0136, -0.0265, -0.0235,  ...,  0.0087,  0.0071,  0.0151],
         [-0.0211,  0.0059, -0.0179,  ...,  0.0163,  0.0122,  0.0073],
         [-0.0437, -0.0150,  0.0029,  ..., -0.0282,  0.0474, -0.0448],
         [-0.0022, -0.0876,  0.0143,  ...,  0.0232, -0.0024, -0.0213],
         [-0.0614, -0.0044, -0.0755,  ..., -0.0522, -0.0310, -0.0248],
         [-0.0145, -0.0100,  0.0060,  ..., -0.0250,  0.0046, -0.0015]]],
       grad_fn=<EmbeddingBackward0>)

We have 6 rows, one for each token and 768 columns, one for each embedding dimension

In [15]:
# Note that the first and last row will be equal
# since without context the CLS and SEP will be the same

model.embeddings.word_embeddings(
    tokenizer.encode('She is Zoe', return_tensors='pt')
)

tensor([[[ 0.0136, -0.0265, -0.0235,  ...,  0.0087,  0.0071,  0.0151],
         [-0.0255, -0.0532,  0.0113,  ...,  0.0024, -0.0210,  0.0090],
         [-0.0360, -0.0246, -0.0257,  ...,  0.0034, -0.0018,  0.0269],
         [-0.0673,  0.0302,  0.0147,  ..., -0.0061,  0.0402, -0.0234],
         [-0.0145, -0.0100,  0.0060,  ..., -0.0250,  0.0046, -0.0015]]],
       grad_fn=<EmbeddingBackward0>)

The CLS and SEP token, before adding context the first encoding will be the same.

In [18]:
# Now we look at the position embeddings
# Our example phrase "I am Sinan" has 6 elements
# CLS, SEP and four tokens since Sinan will be split

model.embeddings.position_embeddings(
    torch.LongTensor(range(6))
)

tensor([[ 1.7505e-02, -2.5631e-02, -3.6642e-02,  ...,  3.3437e-05,
          6.8312e-04,  1.5441e-02],
        [ 7.7580e-03,  2.2613e-03, -1.9444e-02,  ...,  2.8910e-02,
          2.9753e-02, -5.3247e-03],
        [-1.1287e-02, -1.9644e-03, -1.1573e-02,  ...,  1.4908e-02,
          1.8741e-02, -7.3140e-03],
        [-4.1949e-03, -1.1852e-02, -2.1180e-02,  ...,  2.2455e-02,
          5.2826e-03, -1.9723e-03],
        [-5.6087e-03, -1.0445e-02, -7.2288e-03,  ...,  2.0837e-02,
          3.5402e-03,  4.7708e-03],
        [-3.0871e-03, -1.8956e-02, -1.8930e-02,  ...,  7.4045e-03,
          2.0183e-02,  3.4077e-03]], grad_fn=<EmbeddingBackward0>)

In [25]:
# Lastly we have the token type embeddings
# The sequence the tokens are a part of

model.embeddings.token_type_embeddings(
    torch.zeros(6, dtype=torch.long)
)

tensor([[ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086],
        [ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086],
        [ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086],
        [ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086],
        [ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086],
        [ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086]],
       grad_fn=<EmbeddingBackward0>)

In [26]:
# Apply feed-forward normalization layer

model.embeddings.LayerNorm(
    model.embeddings.word_embeddings(
    tokenizer.encode(example_phrase, return_tensors='pt')) + \
    
    model.embeddings.position_embeddings(
    torch.LongTensor(range(6))
)                                                          + \

    model.embeddings.token_type_embeddings(
    torch.zeros(6, dtype=torch.long)
)
)

tensor([[[ 1.6855e-01, -2.8577e-01, -3.2613e-01,  ..., -2.7571e-02,
           3.8253e-02,  1.6400e-01],
         [-3.4025e-04,  5.3974e-01, -2.8805e-01,  ...,  7.5731e-01,
           8.9008e-01,  1.6575e-01],
         [-6.3496e-01,  1.9748e-01,  2.5116e-01,  ..., -4.0819e-02,
           1.3468e+00, -6.9357e-01],
         [ 2.8197e-01, -1.0037e+00,  3.5063e-01,  ...,  8.5378e-01,
           3.9389e-01, -8.4527e-02],
         [-7.3509e-01,  3.3429e-01, -8.3037e-01,  ..., -2.1545e-01,
          -6.6517e-02, -2.6881e-02],
         [-3.2507e-01, -3.1879e-01, -1.1632e-01,  ..., -3.9602e-01,
           4.1120e-01, -7.7552e-02]]], grad_fn=<NativeLayerNormBackward0>)

In [28]:
# If we had just passed it to BERT's embeddings we would get the same Matrix

model.embeddings(
    tokenizer.encode(example_phrase, return_tensors='pt')
)

tensor([[[ 1.6855e-01, -2.8577e-01, -3.2613e-01,  ..., -2.7571e-02,
           3.8253e-02,  1.6400e-01],
         [-3.4026e-04,  5.3974e-01, -2.8805e-01,  ...,  7.5731e-01,
           8.9008e-01,  1.6575e-01],
         [-6.3496e-01,  1.9748e-01,  2.5116e-01,  ..., -4.0819e-02,
           1.3468e+00, -6.9357e-01],
         [ 2.8197e-01, -1.0037e+00,  3.5063e-01,  ...,  8.5378e-01,
           3.9389e-01, -8.4527e-02],
         [-7.3509e-01,  3.3429e-01, -8.3037e-01,  ..., -2.1545e-01,
          -6.6517e-02, -2.6881e-02],
         [-3.2507e-01, -3.1879e-01, -1.1632e-01,  ..., -3.9602e-01,
           4.1120e-01, -7.7552e-02]]], grad_fn=<NativeLayerNormBackward0>)

This proves that all BERT does for the initial embedding is add these representations and pass them thorugh a Norm layer.