Source: https://huggingface.co/learn/nlp-course/chapter2/4?fw=pt

# Tokenizers (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
# !pip install datasets evaluate transformers[sentencepiece]

<span style="color:blue"><b>Tokenizers</b> are one of the core components of the NLP pipeline. They serve one purpose: to <b>translate text into data that can be processed by the model</b>. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data. In this section, we’ll explore exactly what happens in the tokenization pipeline.</span>

In NLP tasks, the data that is generally processed is raw text. Here’s an example of such text:

```
Jim Henson was a puppeteer
```

However, models can only process numbers, so we need to find a way to convert the raw text to numbers. That’s what the tokenizers do, and there are a lot of ways to go about this. <span style="color:blue">The goal is to find the most meaningful representation — that is, the one that makes the most sense to the model — and, if possible, the smallest representation.</span>

<img src="images/tokenizer_1.png" style="width:1000px;" title="Tokenizer">

Let’s take a look at <b>some examples of tokenization algorithms</b>, and try to answer some of the questions you may have about tokenization.

- <b>Word-based</b>
- <b>Character-based</b>
- <b>Sub-word based</b>

## Word-based

The first type of tokenizer that comes to mind is <i>word-based</i>. It’s generally very easy to set up and use with only a few rules, and it often yields decent results. For example, in the image below, the goal is to <b>split the raw text into words</b> and find a numerical representation for each of them:

<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/word_based_tokenization.svg" style="width:800px;" title="Word based tokenization">
<hr>
<img src="images/word-based-tokenization-1.png" style="width:800px;" title="Word based tokenization">


There are different ways to split the text. For example, we could use whitespace to tokenize the text into words by applying Python’s `split()` function:

In [4]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


There are also variations of word tokenizers that have extra rules for punctuation. With this kind of tokenizer, we can end up with some pretty large “vocabularies,” where a <span style="color:blue"><b>vocabulary is defined by the total number of independent tokens that we have in our corpus</b></span>.

Each word gets assigned an ID, starting from 0 and going up to the size of the vocabulary. The model uses these IDs to identify each word.

<span style="color:red">If we want to completely cover a language with a word-based tokenizer, we’ll need to have an identifier for each word in the language, which will generate a <b>huge amount of tokens (large vocabularies result in heavy models)</b>. For example, there are over 500,000 words in the English language, so to build a map from each word to an input ID we’d need to keep track of that many IDs. Furthermore, words like “dog” are represented differently from words like “dogs”, and the model will initially have no way of knowing that “dog” and “dogs” are similar: it will <b>identify the two words as unrelated</b>. The same applies to other similar words, like “run” and “running”, which the model will not see as being similar initially.</span>

<img src="images/word-based-tokenization-issues-1.png" style="width:500px;" title="Word based tokenization issues">
<img src="images/word-based-tokenization-issues-2.png" style="width:400px;" title="Word based tokenization issues">

<span style="color:blue">Finally, we need a <b>custom token</b> to represent words that are not in our vocabulary. This is known as the <b>“unknown” token</b>, often represented as <b>”[UNK]”</b> or <b>””</b>. </span><span style="color:red">It’s generally a bad sign if you see that the tokenizer is producing a lot of these tokens, as it wasn’t able to retrieve a sensible representation of a word and you’re losing information along the way. <span style="color:green"><b>The goal when crafting the vocabulary is to do it in such a way that the tokenizer tokenizes as few words as possible into the unknown token.</b></span>

<span style="color:green">One naive way to overcome the issue and build smaller vocabularies is to limit the amount of words we add to the vocabulary. For e.g., pick the top 10000 most frequently occuring words in the corpus and build the vocabulary.</span> Any other word out of the vocabulary could be marked as `"UNKNOWN"`.

<img src="images/word-based-tokenizer-limit-tokens-issue.png" style="width:800px;" title="Word based tokenization issues">
    


<span style="color:green">One way to reduce the amount of unknown tokens is to go one level deeper, using a <i>character-based</i> tokenizer.</span>

## Character-based

Character-based tokenizers <b>split the text into characters</b>, rather than words. <span style="color:green">This has two primary benefits</span>:

- <span style="color:green">The vocabulary is much smaller.<span>
- <span style="color:blue">There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.</span>

<span style="color:red">But here too some questions arise concerning spaces and punctuation:</span>

<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/character_based_tokenization.svg" style="width:800px;" title="Character based tokenization">

<img src="images/character-based-tokenizationa-1.png" style="width:700px;" title="Character based tokenization">
<img src="images/character-based-tokenizationa-2.png" style="width:700px;" title="Character based tokenization">

<span style="color:red"><b>This approach isn’t perfect either</b>. Since the representation is now based on characters rather than words, one could argue that, intuitively, <b>it’s less meaningful</b>: each character doesn’t mean a lot on its own</b>, whereas that is the case with words. </span> <span style="color:green">However, this again differs according to the language; in Chinese, for example, each character carries more information than a character in a Latin language.</span>

<span style="color:red">Another thing to consider is that we’ll end up with a <b>very large amount of tokens to be processed by our model</b>: whereas a word would only be a single token with a word-based tokenizer, it can easily turn into 10 or more tokens when converted into characters. This can have an impact on the size of the context the model will carry around, and will reduce the size of the text we can use as input for our model.</span>

<span style="color:green">To get the best of both worlds, we can use a third technique that combines the two approaches: <b><i>subword tokenization</i>.</b></span>

## Subword tokenization

<img src="images/subword-based-tokenization-1.png" style="width:700px;" title="Sub-word based tokenization">

<span style="color:blue">Subword tokenization algorithms rely on the principle that <b>frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords</b>.</span>

<img src="images/subword-based-tokenization-2.png" style="width:700px;" title="Sub-word based tokenization">

<b>Subwords help identify similar syntactic or semantic situations in text</b>. For e.g.,
- The model will understand that following words have the similar meaning and are linked
    - Token
    - Tokens
    - Tokenizing
    - Tokenization
    - Tokenizer
- The model will also understand that following words have the same suffixes and probably used in same syntactic situations
    - Tokenization
    - Modernization
    - Immunization
    - Urbanization

<img src="images/subword-based-tokenization-3.png" style="width:550px;" title="Sub-word based tokenization">

For instance, “annoyingly” might be considered a rare word and could be decomposed into “annoying” and “ly”. These are both likely to appear more frequently as standalone subwords, while at the same time the meaning of “annoyingly” is kept by the composite meaning of “annoying” and “ly”.

Here is an example showing how a subword tokenization algorithm would tokenize the sequence “Let’s do tokenization!“:

<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/bpe_subword.svg" style="width:800px;" title="Subword based tokenization">

<span style="color:green">These subwords end up providing a <b>lot of semantic meaning</b>: for instance, in the example above “tokenization” was split into “token” and “ization”, two tokens that have a semantic meaning while being space-efficient (only two tokens are needed to represent a long word). This allows us to have <b>relatively good coverage with small vocabularies, and close to no unknown tokens</b>.</span>

This approach is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.

Subword based tokenizers generally have a way to identify which tokens are start of words and which tokens complete start of word. For e.g., 
- <i>token</i> - Start of word
- <i>##ization</i> - Completion of word
    - <i>##</i> - Prefix indicates that "##ization" is part of word rather than the beginning of word. The "##" comes from BERT tokenizer. Other tokenizers use other prefixes which can be used to indicate part of words   

<img src="images/subword-based-tokenization-4.png" style="width:350px;" title="Sub-word based tokenization">

### And more!
Unsurprisingly, there are many more techniques out there. To name a few:

- Byte-level BPE, as used in GPT-2
- WordPiece, as used in BERT
- SentencePiece or Unigram, as used in several multilingual models

<img src="images/subword-based-tokenization-5.png" style="width:650px;" title="Sub-word based tokenization">

You should now have sufficient knowledge of how tokenizers work to get started with the API.

## Loading and saving

Loading and saving tokenizers is as simple as it is with models. Actually, it’s based on the same two methods: `from_pretrained()` and `save_pretrained()`. <span style="color:green">These methods will load or save the algorithm used by the tokenizer (a bit like the <i>architecture</i> of the model) as well as its vocabulary (a bit like the <i>weights</i> of the model).</span>

Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the `BertTokenizer` class:

In [5]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Similar to `AutoModel`, the `AutoTokenizer` class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint:

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

We can now use the tokenizer as shown in the previous section:

In [3]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Saving a tokenizer is identical to saving a model:

In [7]:
tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json',
 'directory_on_my_computer/tokenizer.json')

In [12]:
ls -a -l -h directory_on_my_computer/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
total 853856
drwxr-xr-x   8 prasanth.thangavel  staff   256B Jul  4 20:47 [34m.[m[m/
drwxr-xr-x  11 prasanth.thangavel  staff   352B Jul  4 20:48 [34m..[m[m/
-rw-r--r--   1 prasanth.thangavel  staff   656B Jul  4 16:07 config.json
-rw-r--r--   1 prasanth.thangavel  staff   413M Jul  4 16:07 pytorch_model.bin
-rw-r--r--   1 prasanth.thangavel  staff   125B Jul  4 20:47 special_tokens_map.json
-rw-r--r--   1 prasanth.thangavel  staff   653K Jul  4 20:47 tokenizer.json
-rw-r--r--   1 prasanth.thangavel  staff   315B Jul  4 20:47 tokenizer_config.json
-rw-r--r--   1 prasanth.thangavel  staff   208K Jul  4 20:47 vocab.txt


We’ll talk more about `token_type_ids` in Chapter 3, and we’ll explain the `attention_mask` key a little later. First, let’s see how the input_ids are generated. To do this, we’ll need to look at the intermediate methods of the tokenizer.

## Encoding

<span style="color:blue">Translating text to numbers is known as <i>encoding</i>. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.</span>

As we’ve seen, <b>the first step is to split the text into word (or parts of words, punctuation symbols, etc.), usually called <i>tokens</i></b>. There are multiple rules that can govern that process, which is why we need to instantiate the tokenizer using the name of the model, to <b>make sure we use the same rules that were used when the model was pretrained</b>.

<b>The second step is to convert those tokens into numbers</b>, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a <i>vocabulary</i>, which is the part we download when we instantiate it with the `from_pretrained()` method. Again, we need to <b>use the same vocabulary used when the model was pretrained</b>.

To get a better understanding of the two steps, we’ll explore them separately. Note that we will use some methods that perform parts of the tokenization pipeline separately to show you the intermediate results of those steps, but in practice, you should call the tokenizer directly on your inputs (as shown in the section 2).

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer("Let's try to tokenize!") # Raw text
print (inputs["input_ids"]) # Input ids - A list of numbers

[101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]


<img src="images/encoding-1.png" style="width:550px;" title="Encoding">

<b>In summary, there are 3 steps involved in converting raw text into Input IDs for the model</b>
1. Tokenization using `tokenize`
2. From tokens to input IDs using `convert_tokens_to_ids`
3. Add special tokens to input IDs using `prepare_for_model`

### Tokenization

The tokenization process is done by the `tokenize()` method of the tokenizer. The output of this method is a list of strings, or tokens.

Different tokenizers might use different prefix to indicate parts of the word.

In [22]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("albert-base-v1")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['▁using', '▁a', '▁transform', 'er', '▁network', '▁is', '▁simple']


In [20]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['using', 'a', 'transform', '##er', 'network', 'is', 'simple']


In [24]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


<span style="color:green">This `bert-base-cased` tokenizer is a <b>subword tokenizer</b>: it splits the words until it obtains tokens that can be represented by its vocabulary. That’s the case here with `Transformer`, which is split into two tokens: `Trans` and `##former`.</span>

### From tokens to input IDs

The conversion of tokens to respective input IDs is handled by the `convert_tokens_to_ids()` tokenizer method:

In [26]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


### Add special tokens

Lastly, the tokenizer adds special tokens the model expects.

In [28]:
final_inputs = tokenizer.prepare_for_model(ids)
print (final_inputs["input_ids"])

[101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102]


In [34]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
print (tokenizer(sequence)["input_ids"]) # Same as above results

[101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102]


<span style="color:green">These outputs (tokens), once converted to the appropriate framework tensor, can then be used as inputs to a model</span> as seen earlier in this chapter.

## Decoding

<span style="color:blue"><i>Decoding</i> is going the other way around: <b>from vocabulary indices, we want to get a string</b></span>. This can be done with the `decode()` method as follows

In [35]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


In [36]:
decoded_string = tokenizer.decode([101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102])
print(decoded_string)

[CLS] Using a Transformer network is simple [SEP]


<span style="color:green">Note that the decode method not only converts the indices back to tokens, but <b>also groups together the tokens that were part of the same words to produce a readable sentence</b>. This behavior will be extremely useful when we use models that predict new text (either <b>text generated from a prompt<b>, or for <b>sequence-to-sequence problems</b> like <b>translation</b> or <b>summarization</b>).</span>

By now you should understand the atomic operations a tokenizer can handle: tokenization, conversion to IDs, and converting IDs back to a string. However, we’ve just scraped the tip of the iceberg. In the following section, we’ll take our approach to its limits and take a look at how to overcome them.

Let's look at the difference in special tokens in different tokenizer models.

In [37]:
print (sequence)

Using a Transformer network is simple


In [38]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
inputs = tokenizer(sequence)

print (tokenizer.decode(inputs['input_ids']))

'[CLS] Using a Transformer network is simple [SEP]'

In [40]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
inputs = tokenizer(sequence)

print (tokenizer.decode(inputs['input_ids']))

<s>Using a Transformer network is simple</s>
