## Subword Tokenization

Naturally, word tokenization can result in a massive vocabulary -- even larger if there are many unique or rare words (think specific medical or technical words). Let's say there are about 1 million unique English words and assume each word vector has 1000 dimensions. This would result in a matrix for the input layer of a neural network to have 1 million x 1000 = 1 billion weights.

One way to address this problem is to limit the vocabulary to include only the most common words in the corupus (e.g., grab 100,000 common words) and classify the others as 'unknown' with a shared unknown token. Subword tokenization is an effort to decrease the number of things we need to store while recording common subwords that may be included in some more rare or specific words.

In [5]:
from transformers import AutoTokenizer, BertModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
text = 'i love cryptography, mathematics, and cybersecurity.'
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

['i', 'love', 'crypt', '##ography', ',', 'mathematics', ',', 'and', 'cyber', '##se', '##cu', '##rity', '.']


In [8]:
from transformers import AutoTokenizer, GPT2Model

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")

In [9]:
text = 'i love cryptography, mathematics, and cybersecurity.'
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

['i', 'Ġlove', 'Ġcryptography', ',', 'Ġmathematics', ',', 'Ġand', 'Ġcybersecurity', '.']


## Token, Vector, and Embedding

To get to a point where your model can understand text, you first have to tokenize it, vectorize it and create embeddings from these vectors.

- Tokenization: This is the process of dividing the original text into individual pieces called tokens. Each token is assigned a unique id to represent it as a number.
- Vectorization: The unique ids are then assigned to randomly initialized n-dimensional vectors.
- Embedding: To give tokens meaning, the model must be trained on them. This allows the model to learn the meanings of words and how they relate to other words. To achieve this, the word vectors are “embedded” into an embedding space. As a result, similar words should have similar vectors after training.

## Tokenization

In [10]:
from transformers import AutoTokenizer, BertModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
text = 'I love cryptography, mathematics, and cybersecurity.'
tokens = tokenizer.tokenize(text)
print(tokens)

['i', 'love', 'crypt', '##ography', ',', 'mathematics', ',', 'and', 'cyber', '##se', '##cu', '##rity', '.']


In [12]:
#Conver the tokens to ids
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[1045, 2293, 19888, 9888, 1010, 5597, 1010, 1998, 16941, 3366, 10841, 15780, 1012]


In [13]:
#We can go backwards from ids to tokens
print(tokenizer.convert_ids_to_tokens(token_ids))

['i', 'love', 'crypt', '##ography', ',', 'mathematics', ',', 'and', 'cyber', '##se', '##cu', '##rity', '.']


In [14]:
#There is a built in encode function that also does this.
token_ids = tokenizer.encode(text)
print(token_ids)
#The encode function translate the token with 101 to the specail [CLS] token which represents the beggining of the sequence <BOS> token. The token with id 102 is the end of the sequence (<EOS>) token.
tokens_from_encode = tokenizer.convert_ids_to_tokens(token_ids)
print(tokens_from_encode)

[101, 1045, 2293, 19888, 9888, 1010, 5597, 1010, 1998, 16941, 3366, 10841, 15780, 1012, 102]
['[CLS]', 'i', 'love', 'crypt', '##ography', ',', 'mathematics', ',', 'and', 'cyber', '##se', '##cu', '##rity', '.', '[SEP]']


## Vectorization

To obtain an embedding for a token, you first need to create a model. The following will download the pre-trained bert-base-uncased model with its weights and its embeddings.

In [15]:
import torch

In [16]:
# get the embedding vector for the word "cyber" or whatever example you want to use!
example_word = "cyber"
example_token_id = tokenizer.convert_tokens_to_ids([example_word])[0]
example_embedding = model.embeddings.word_embeddings(torch.tensor([example_token_id]))

print(example_embedding.shape)
print(example_embedding)
# torch.Size([1, 768])
# The returned word vector has a size of 768 dimensions, the same as the BERT model. 

torch.Size([1, 768])
tensor([[-3.9401e-02, -3.4605e-02, -3.0109e-02, -5.4147e-02,  6.7955e-02,
          8.1688e-02, -4.9087e-02, -8.6033e-03, -2.2622e-02, -3.3102e-02,
          3.3617e-02, -3.2629e-02, -5.9386e-02, -5.0595e-02, -5.0741e-02,
          9.8050e-03, -9.8762e-02, -3.0505e-02,  1.6212e-02, -9.1320e-02,
         -1.9809e-02, -7.5010e-02, -8.5134e-03,  2.1966e-02, -7.5120e-02,
         -7.9889e-02,  2.4941e-03, -5.2005e-02, -7.0397e-02, -1.2168e-03,
          7.5779e-02,  1.8078e-02, -1.6702e-04, -7.9205e-03,  1.4067e-02,
         -3.7053e-02, -5.6451e-02, -7.7005e-02, -9.4111e-02, -7.0676e-03,
          8.8820e-03,  1.0885e-02, -3.4853e-02, -4.3510e-02,  1.8766e-02,
         -2.2594e-02, -5.9443e-02, -4.7548e-02, -3.8350e-02,  5.0268e-03,
         -8.0528e-03, -8.3088e-03, -4.2568e-03,  6.0233e-02, -3.2517e-02,
         -2.4433e-02,  5.7999e-02, -2.1103e-02, -9.5786e-03, -1.3551e-02,
         -7.1807e-03,  4.9844e-02,  5.2372e-02, -6.0588e-02, -4.2348e-02,
          2.8226e

## Cosine similarity
Cosine similarity is a way to measure how similar two things are. It’s often used in natural language processing to compare the content of two texts.

To calculate the cosine similarity, we look at the angle between two vectors. If the vectors point in the same direction, they are more similar, and if they point in opposite directions, they are less similar. The result is a number between -1 and 1, where 1 means the vectors are identical and -1 means they are completely different.

## Euclidean Distance
Euclidean Distance is another way to measure how similar two things are. To calculate the Euclidean Distance, d(u,v), we compute the length, |u - v|. The result is non-negative number, where 0 means the vectors are identical and the larger the number the further they are away from each other.

In [17]:
import torch

In [18]:
word_one = "king"
word_two = "queen"
one_token_id = tokenizer.convert_tokens_to_ids([word_one])[0]
one_embedding = model.embeddings.word_embeddings(torch.tensor([one_token_id]))

two_token_id = tokenizer.convert_tokens_to_ids([word_two])[0]
two_embedding = model.embeddings.word_embeddings(torch.tensor([two_token_id]))

cos = torch.nn.CosineSimilarity(dim=1)
cosine_similarity = cos(one_embedding, two_embedding)
euclidean_distance = torch.cdist(one_embedding,two_embedding)
print(f'Cosine Similarity between \'{word_one}\' and \'{word_two}\': {cosine_similarity[0]}')
# 0.646
print(f'Euclidean Distance between \'{word_one}\' and \'{word_two}\': {euclidean_distance[0][0]}')
# 0.886

Cosine Similarity between 'king' and 'queen': 0.6468513011932373
Euclidean Distance between 'king' and 'queen': 0.8862659335136414


In [19]:
word_three = "man"
word_four = "woman"
three_token_id = tokenizer.convert_tokens_to_ids([word_three])[0]
three_embedding = model.embeddings.word_embeddings(torch.tensor([three_token_id]))

four_token_id = tokenizer.convert_tokens_to_ids([word_four])[0]
four_embedding = model.embeddings.word_embeddings(torch.tensor([four_token_id]))

cos = torch.nn.CosineSimilarity(dim=1)
cosine_similarity = cos(three_embedding, four_embedding)
euclidean_distance = torch.cdist(three_embedding,four_embedding)
print(f'Cosine Similarity between \'{word_three}\' and \'{word_four}\': {cosine_similarity[0]}')
# 0.633
print(f'Euclidean Distance between \'{word_three}\' and \'{word_four}\': {euclidean_distance[0][0]}')
# 0.847

Cosine Similarity between 'man' and 'woman': 0.6337041854858398
Euclidean Distance between 'man' and 'woman': 0.8479673266410828
