# What's happening behind the scene of Tokenization?

Coded and shared by Divya Patel, Microsoft.

In [1]:
import torch

In [2]:
# Step 1: Load the model and tokenizer
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [3]:
# Step 2: Preprocess the input
inputs = tokenizer("We liked the embedders, we were okay with encoder decoders, but we love the transformers.", return_tensors="pt")

In [4]:
inputs

{'input_ids': tensor([[  101,  2057,  4669,  1996,  7861,  8270, 13375,  1010,  2057,  2020,
          3100,  2007,  4372, 16044,  2099, 21933, 13375,  1010,  2021,  2057,
          2293,  1996, 19081,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1]])}

## How did we get from input string to these numbers?

## Tokenization

- Our inputs are text. These models work with numbers, so the first thing we need to do is to convert the text inputs into numbers.

- The tokenizer first tokenizes the inputs. This means that it splits the input string in words (or part of words, punctuation symbols, etc.) that are called tokens.

- Then, it converts these tokens into numbers. Each token is associated to an input ID, which is an integer. The same token will always be associated to the same ID.
- It also generates other inputs that are necessary to the model,
    - like the attention mask (a vector that has the same length of the input IDs and has a 0 if the corresponding token is a padding token, and a 1 otherwise)
    - and the token type ids (that are used to distinguish different parts of the input, when it has several of them, like for question-answering, for instance).
- The tokenizer returns a dictionary containing all the inputs that the model will expect.

In [5]:
tokenized_str = tokenizer.tokenize("We liked the embedders, we were okay with encoder decoders, but we love the transformers.")
print(tokenized_str)

['we', 'liked', 'the', 'em', '##bed', '##ders', ',', 'we', 'were', 'okay', 'with', 'en', '##code', '##r', 'deco', '##ders', ',', 'but', 'we', 'love', 'the', 'transformers', '.']


In [6]:
tokenized_inp_id = tokenizer.convert_tokens_to_ids(tokenized_str)
print(tokenized_inp_id)

[2057, 4669, 1996, 7861, 8270, 13375, 1010, 2057, 2020, 3100, 2007, 4372, 16044, 2099, 21933, 13375, 1010, 2021, 2057, 2293, 1996, 19081, 1012]


In [7]:
tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token

('[CLS]', '[SEP]', '[PAD]')

In [8]:
tokenizer.cls_token_id, tokenizer.sep_token_id, tokenizer.pad_token_id

(101, 102, 0)

In [9]:
tokenized_inp_id = [tokenizer.cls_token_id] + tokenized_inp_id + [tokenizer.sep_token_id]
print(tokenized_inp_id)

[101, 2057, 4669, 1996, 7861, 8270, 13375, 1010, 2057, 2020, 3100, 2007, 4372, 16044, 2099, 21933, 13375, 1010, 2021, 2057, 2293, 1996, 19081, 1012, 102]


In [10]:
inputs['input_ids'] == torch.tensor(tokenized_inp_id).unsqueeze(0)

tensor([[True, True, True, True, True, True, True, True, True, True, True, True,
         True, True, True, True, True, True, True, True, True, True, True, True,
         True]])

In [11]:
tokenizer.decode(tokenized_inp_id)

'[CLS] we liked the embedders, we were okay with encoder decoders, but we love the transformers. [SEP]'

In [12]:
inputs['attention_mask']

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1]])

In [13]:
inputs_batch2 = tokenizer(["this summer is killing me", "me too"], padding=True, return_tensors="pt")

In [14]:
inputs_batch2['attention_mask']

tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0, 0, 0]])

In [15]:
tokenized_str = tokenizer.tokenize("We liked the embedders, we were okay with encoder decoders, but we love the transformers.")
print(tokenized_str)

['we', 'liked', 'the', 'em', '##bed', '##ders', ',', 'we', 'were', 'okay', 'with', 'en', '##code', '##r', 'deco', '##ders', ',', 'but', 'we', 'love', 'the', 'transformers', '.']


## How did it decide to split some words and not others?

- During training the model encountered some words from the training dataset. From those words it creates a vocabulary with which it can work with.

- If every encountered word is treated as a separate token, then that will lead to a very large vocabulary size, indirectly increasing the size of model.
- To avoid this, the tokenizer uses subword tokenization, which means that it splits some words into smaller parts.
- The tokenizer is trained to perform this split in a way that minimizes the vocabulary size while keeping the meaning of the words.
- This is why the word "embedders" was split into "em", "##bed", "##ders".
## How to check whether my word is entirely present in vocabulary or not?
- The tokenizer has a method called get_vocab() that returns the vocabulary of the tokenizer. You can use it to check if a word is in the vocabulary or not.

- For instance, tokenizer.get_vocab().get("embedders") will return None, while tokenizer.get_vocab().get("embed") will return a number.

BERT uses WordPiece tokenization, which means that if a word is not in the vocabulary, it will be split into subwords.

In [16]:
print(tokenizer.get_vocab().get("embedders"))

print(tokenizer.get_vocab().get("transformers"))

None
19081


In [17]:
vocab = tokenizer.get_vocab()
len(vocab)

30522

In [18]:
vocab

{'established': 2511,
 'membranes': 24972,
 'gerry': 14926,
 'beings': 9552,
 'clapped': 18310,
 'tall': 4206,
 'operate': 5452,
 'font': 15489,
 '##roids': 29514,
 'resolving': 29304,
 'jonathan': 5655,
 '[unused873]': 878,
 '##unk': 16814,
 '##wine': 21924,
 'midwest': 13608,
 '##in': 2378,
 'dragons': 8626,
 'candles': 14006,
 '##ake': 13808,
 'aug': 15476,
 'nepal': 8222,
 '##ef': 12879,
 'lifeless': 22185,
 'dail': 26181,
 'lodges': 26767,
 '##・': 30264,
 'kidney': 14234,
 'marcia': 22548,
 '##mable': 24088,
 '##cum': 24894,
 'squealed': 26175,
 'ð': 1098,
 'rocked': 14215,
 'rams': 13456,
 '##laid': 24393,
 'contend': 27481,
 'composer': 4543,
 'migration': 9230,
 'relations': 4262,
 'injuring': 22736,
 'champaign': 28843,
 'hearst': 25419,
 'felix': 8383,
 'protagonists': 21989,
 'vaults': 28658,
 'array': 9140,
 'valley': 3028,
 'weapon': 5195,
 'charging': 13003,
 'craftsman': 26286,
 'pop': 3769,
 '##pie': 14756,
 '##zuka': 22968,
 'awkwardly': 18822,
 'mistakes': 12051,
 'ch

### How can we add new words into tokenizer?

If you want to add a new word to the vocabulary, you can use the add_tokens() method.

In [19]:
# Add new word into tokenizer
tokenizer.add_tokens(["embedders"])

1

In [20]:
tokenizer.get_vocab().get("embedders")

30522

In [21]:
vocab = tokenizer.get_vocab()
len(vocab)

30523

In [22]:
tokenized_str = tokenizer.tokenize("We liked the embedders, we were okay with encoder decoders, but we love the transformers.")
print(tokenized_str)

['we', 'liked', 'the', 'embedders', ',', 'we', 'were', 'okay', 'with', 'en', '##code', '##r', 'deco', '##ders', ',', 'but', 'we', 'love', 'the', 'transformers', '.']
