***Tokenize + Embedding layer***

Consider the following sentence:

"Another beautiful day"

To tokenize this, we split it into a list of tokens in our vocabulary over 30000 tokens.

We also add two special tokens CLS at the beginning and SEP at the end.

CLS is meant to represent the entire sequence and SEP is meant to represent separation between sentences.

["CLS", "another", "beautiful", "day", "SEP"]

**What about new words**

In "Sinan loves a beautiful day", BERT does not recognize "Sinan" and it becomes:

["CLS", "sin", "##an", "loves", "a", "beautiful", "day", "SEP"]

Where "##" indicates a subword

BERT's tokenizer is great at handling tokens that are OOV (Out of Vocabulary) by breaking them up into smaller chunks of known tokens.

**Freezing the model**

If we pass BERT information on a specific domain where there are many OOVs we need to be careful with freezing since that will prevent BERT from learning the specific usage of those tokens in the greater domain's context.

**Note on tokenization**

BERT has a maximum sequence length of 512 tokens. This was implemented for the sake of efficiency.

Any sequence less than 512 tokens will be padded to reach 512 and if over 512 the model may error out.

**Uncases VS Cased tokenization**

- Uncased: Removes accents and lower cases the input
- Cased: Does nothing to the input

Uncased tokenization is usually best for most situations since case doesn't usually contribute to context.

Cased tokenizations works really well in cases where Named Entity Recognition is important.

Note that his has little to do with the BERT architecture but with tokenization.

**Words with context**

Consider the following sentences:

"I love my pet Python"

"I love coding in Python"

The token "Python" will end up with a different vector representation for each sentence, this is because of the surrounding words in the sentence.

In [1]:
# imports

import torch
from transformers import BertModel, BertTokenizer
from sklearn.metrics.pairwise import cosine_similarity

  from .autonotebook import tqdm as notebook_tqdm


In [28]:
# Loading the BERT model

model = BertModel.from_pretrained('bert-base-uncased')

In [3]:
# Loading tokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [4]:
# Checking the vocabulary length

print(f"The length of BERT's vocabulary is: {len(tokenizer.vocab)}")

The length of BERT's vocabulary is: 30522


In [5]:
# Tokenizing a simple sequence

text = "A simple sentence"
tokens = tokenizer.encode(text)

tokens

[101, 1037, 3722, 6251, 102]

We can see how we get the CLS (101) and the SEP (102) tokens and one token for each word, meaning they were already in BERT's vocabulary.

In [6]:
# Checking a token not in BERT's vocabulary

text1 = "Lochlainn is not real name"
tokens1 = tokenizer.encode(text1)

tokens1

[101, 14941, 15987, 2078, 2003, 2025, 2613, 2171, 102]

Clearly BERT had never seen such a name.

In [7]:
# Reconstructing the original sentence

tokenizer.decode(tokens)

'[CLS] a simple sentence [SEP]'

In [9]:
# Looking at a more complex sequence

text2 = "My friend told me about this class and I love it so far! She was right."
tokens2 = tokenizer.encode(text2)

print(tokens2)

[101, 2026, 2767, 2409, 2033, 2055, 2023, 2465, 1998, 1045, 2293, 2009, 2061, 2521, 999, 2016, 2001, 2157, 1012, 102]


In [11]:
# Looking at each token in more detail

print(f'Text: {text2}. Num tokens: {len(tokens2)}')

for t in tokens2:
    print(f'Token: {t}, subword: {tokenizer.decode(t)}')

Text: My friend told me about this class and I love it so far! She was right.. Num tokens: 20
Token: 101, subword: [ C L S ]
Token: 2026, subword: m y
Token: 2767, subword: f r i e n d
Token: 2409, subword: t o l d
Token: 2033, subword: m e
Token: 2055, subword: a b o u t
Token: 2023, subword: t h i s
Token: 2465, subword: c l a s s
Token: 1998, subword: a n d
Token: 1045, subword: i
Token: 2293, subword: l o v e
Token: 2009, subword: i t
Token: 2061, subword: s o
Token: 2521, subword: f a r
Token: 999, subword: !
Token: 2016, subword: s h e
Token: 2001, subword: w a s
Token: 2157, subword: r i g h t
Token: 1012, subword: .
Token: 102, subword: [ S E P ]


In [12]:
# Sinan is not in the vocabulary

'Sinan' in tokenizer.vocab

False

In [14]:
# Looking at tokenization with OOV

text_unkown_words = "Sinan loves a beautiful day"
tokens_unknown = tokenizer.encode(text_unkown_words)

tokens_unknown

[101, 8254, 2319, 7459, 1037, 3376, 2154, 102]

In [15]:
# Looking at each individual token in the sequence

for t in tokens_unknown:
    print(f'Token: {t}, word/subword: {tokenizer.decode(t)}')

Token: 101, word/subword: [ C L S ]
Token: 8254, word/subword: s i n
Token: 2319, word/subword: # # a n
Token: 7459, word/subword: l o v e s
Token: 1037, word/subword: a
Token: 3376, word/subword: b e a u t i f u l
Token: 2154, word/subword: d a y
Token: 102, word/subword: [ S E P ]


In [16]:
# Using encode plus
# Will give us the token ids and an attention mass
# It is a sequence of 0s and 1s, 1 if the token should be taken into consideration
# for attention computations and 0 otherwise

text = 'My friend told me about this class and I love it so far! She was right.'
tokens = tokenizer.encode_plus(text)

tokens

{'input_ids': [101, 2026, 2767, 2409, 2033, 2055, 2023, 2465, 1998, 1045, 2293, 2009, 2061, 2521, 999, 2016, 2001, 2157, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [17]:
# Calling the tokenizer directly is the same as using encode_plus

tokenizer(text)

{'input_ids': [101, 2026, 2767, 2409, 2033, 2055, 2023, 2465, 1998, 1045, 2293, 2009, 2061, 2521, 999, 2016, 2001, 2157, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The token type id shows that all tokens in the sequence come from the first (and only) sentence.

In [18]:
# Python example

text0 = 'I love my pet Python'
text1 = 'I love coding in Python'

python_pet = tokenizer.encode(text0)
python_coding = tokenizer.encode(text1)

In [19]:
python_pet

[101, 1045, 2293, 2026, 9004, 18750, 102]

In [20]:
python_coding

[101, 1045, 2293, 16861, 1999, 18750, 102]

We can see how "Python" is the same token, now we will pass these as tensors into the BERT model and compute the cosine similarity after we have added context to the vector representation of each of the tokens on the sequences.

In [33]:
# Adding context to pet sequence

python_pet_embedding = model(torch.tensor(python_pet).unsqueeze(0))\
[0][:, 5, :].detach().numpy()

In [32]:
# Context to coding sequence

python_coding_embedding = model(torch.tensor(python_coding).unsqueeze(0))\
[0][:, 5, :].detach().numpy()

In [37]:
# Now we will compare the word snake to each of these two vector
# representations of the word Python
# And the same with the word programming

snake_embedding = model(torch.tensor(tokenizer.encode('snake')).unsqueeze(0))\
[0][:, 1, :].detach().numpy()

programming_embedding = model(torch.tensor(tokenizer.encode('programming'))
                              .unsqueeze(0))[0][:, 1, :].detach().numpy()

In [38]:
# Similarity between the word Python in a sentence about coding and snake

cosine_similarity(python_coding_embedding, snake_embedding)

array([[0.5843479]], dtype=float32)

In [40]:
# Similarity between the word Python in a sentence about pets and snake

cosine_similarity(python_pet_embedding, snake_embedding)

array([[0.6928655]], dtype=float32)

In [41]:
# Similarity between programming and python from coding sentence

cosine_similarity(python_coding_embedding, programming_embedding)

array([[0.5614743]], dtype=float32)

In [42]:
# Similarity between python from pet sequence and programming

cosine_similarity(python_pet_embedding, programming_embedding)

array([[0.49864346]], dtype=float32)