# Tokenizers (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]

Also, log into Hugging face

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

<font color='blue'>Tokenizers</font> are one of the <font color='blue'>core components</font> of the NLP pipeline. They serve one purpose: to <font color='blue'>translate text</font> into <font color='blue'>data</font> that can be <font color='blue'>processed by the model</font>. Models can only process numbers, so tokenizers need to <font color='blue'>convert</font> our <font color='blue'>text inputs</font> to <font color='blue'>numerical data</font>. In this section, we'll explore exactly what happens in the <font color='blue'>tokenization pipeline</font>.

In NLP tasks, the data that is generally processed is raw text. Here's an example of such text:

In [3]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


However, <font color='blue'>models</font> can only <font color='blue'>process numbers</font>, so we need to find a way to <font color='blue'>convert</font> the <font color='blue'>raw text</font> to <font color='blue'>numbers</font>. That's what the tokenizers do, and there are a lot of ways to go about this. The goal is to find the most meaningful representation — that is, the one that makes the most sense to the model — and, if possible, the smallest representation.

Let's take a look at some <font color='blue'>examples</font> of <font color='blue'>tokenization algorithms</font>, and try to answer some of the questions you may have about tokenization.

### Word-based

The <font color='blue'>first type of tokenizer</font> that comes to mind is <font color='blue'>word-based</font>. It's generally very easy to set up and use with only a few rules, and it often yields decent results. For example, in the image below, the goal is to <font color='blue'>split</font> the <font color='blue'>raw text</font> into <font color='blue'>words</font> and find a numerical representation for each of them:

en_chapter2_word_based_tokenization.svg

There are different ways to split the text. For example, we could use <font color='blue'>whitespace</font> to <font color='blue'>tokenize the text</font> into words by applying Python's <font color='blue'>split()</font> function:

In [4]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


There are also <font color='blue'>variations</font> of word tokenizers that have <font color='blue'>extra rules for punctuation</font>. With this kind of tokenizer, we can end up with some pretty large “vocabularies,” where a <font color='blue'>vocabulary</font> is defined by the <font color='blue'>total number of independent tokens</font> that we have in our corpus.

Each <font color='blue'>word</font> gets <font color='blue'>assigned</font> an <font color='blue'>ID, starting from 0 and going up to the size of the vocabulary. The <font color='blue'>model</font> uses these <font color='blue'>IDs</font> to <font color='blue'>identify</font> each word.

If we want to completely cover a language with a word-based tokenizer, we'll need to have an <font color='blue'>identifie</font> for <font color='blue'>each word</font> in the <font color='blue'>language</font>, which will generate a <font color='blue'>huge amount</font>  of <font color='blue'>tokens</font>. For example, there are over <font color='blue'>500,000</font>  words in the <font color='blue'>English</font> language, so to build a map from each word to an input ID we'd need to keep track of that many IDs. Furthermore, words like <font color='blue'>dog</font> are represented <font color='blue'>differently</font> from words like <font color='blue'>dogs</font>, and the model will initially have no way of knowing that “dog” and “dogs” are similar: it will identify the two words as unrelated. The same applies to other similar words, like “run” and “running”, which the model will not see as being similar initially.

Finally, we need a <font color='blue'>custom token</font> to represent <font color='blue'>words</font> that are <font color='blue'>not</font> in our <font color='blue'>vocabulary</font>. This is known as the <font color='blue'>unknown token</font>, often represented as <font color='blue'>[UNK]</font> or <font color='blue'><unk></font>. It's generally a bad sign if you see that the tokenizer is producing a lot of these tokens, as it wasn't able to retrieve a sensible representation of a word and you're losing information along the way. The goal when crafting the vocabulary is to do it in such a way that the tokenizer tokenizes as few words as possible into the unknown token.

One way to <font color='blue'>reduce</font> the amount of <font color='blue'>unknown tokens</font> is to go one level deeper, using a <font color='blue'>character-based</font> tokenizer.

### Character-based

Character-based tokenizers split the text into characters, rather than words. This has two primary benefits:

- The <font color='blue'>vocabulary</font> is much <font color='blue'>smaller</font>.
- There are much <font color='blue'>fewer out-of-vocabulary</font> (unknown) <font color='blue'>tokens</font>, since *every word* can be built from characters.

But here too some questions arise concerning spaces and punctuation:

en_chapter2_character_based_tokenization.svg

This approach isn't perfect either. Since the representation is now based on characters rather than words, one could argue that, intuitively, it's <font color='blue'>less meaningful</font>: each <font color='blue'>character doesn't mean</font> a <font color='blue'>lot</font> on <font color='blue'>its own</font>, whereas that is the case with words. However, this again <font color='blue'>differs</font> according to the <font color='blue'>language</font>; in Chinese, for example, each character carries more information than a character in a Latin language.

Another thing to consider is that we'll end up with a <font color='blue'>very large amount</font> of <font color='blue'>tokens</font> to be <font color='blue'>processed</font> by our model: whereas a word would only be a single token with a word-based tokenizer, it can easily turn into 10 or more tokens when converted into characters.

To get the best of both worlds, we can use a third technique that combines the two approaches: *subword tokenization*.

### Subword tokenization

Subword tokenization algorithms rely on the principle that <font color='blue'>frequently used words</font> should <font color='blue'>not</font> be <font color='blue'>split</font> into <font color='blue'>smaller subwords</font>, but <font color='blue'>rare words</font> should be <font color='blue'>decomposed</font> into meaningful <font color='blue'>subwords</font>.

For instance, <font color='blue'>annoyingly</font> might be considered a rare word and could be <font color='blue'>decomposed</font> into <font color='blue'>annoying</font> and <font color='blue'>ly</font>. These are both likely to appear more frequently as standalone subwords, while at the same time the meaning of “annoyingly” is kept by the composite meaning of “annoying” and “ly”.

Here is an example showing how a subword tokenization algorithm would tokenize the sequence <font color='blue'>Let's do tokenization!</font>:

en_chapter2_bpe_subword.svg

These <font color='blue'>subwords</font> end up providing a lot of <font color='blue'>semantic meaning</font>: for instance, in the example above “tokenization” was split into “token” and “ization”, <font color='blue'>two tokens</font> that have a <font color='blue'>semantic meaning</font> while being <font color='blue'>space-efficient</font> (only two tokens are needed to represent a long word). This allows us to have relatively <font color='blue'>good coverage</font> with <font color='blue'>small vocabularies</font>, and close to no unknown tokens.

This approach is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.

**And more!**

Unsurprisingly, there are many more techniques out there. To name a few:

- <font color='blue'>Byte-level BPE</font>, as used in <font color='blue'>GPT-2</font>
- <font color='blue'>WordPiece</font>, as used in <font color='blue'>BERT</font>
- <font color='blue'>SentencePiece</font> or Unigram, as used in several <font color='blue'>multilingual models</font>

You should now have sufficient knowledge of how tokenizers work to get started with the API.

### Loading and saving

Loading and saving tokenizers is as simple as it is with models. Actually, it's <font color='blue'>based</font> on the same <font color='blue'>two methods</font>: <font color='blue'>from_pretrained()</font> and <font color='blue'>save_pretrained()</font>. These methods will load or save the algorithm used by the tokenizer (a bit like the *architecture* of the model) as well as its vocabulary (a bit like the *weights* of the model).

Loading the <font color='blue'>BERT tokenizer</font> trained with the <font color='blue'>same checkpoint</font> as <font color='blue'>BERT</font> is done the same way as loading the model, except we use the `BertTokenizer` class:

In [16]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Similar to AutoModel, the <font color='blue'>AutoTokenizer class</font> will grab the <font color='blue'>proper tokenizer class</font> in the library based on the <font color='blue'>checkpoint name</font>, and can be used directly with any checkpoint:

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

We can now use the tokenizer as shown in the previous section:

In [23]:
tokenized_output = tokenizer("Using a Transformer network is simple")
print("input_ids:", tokenized_output['input_ids'])
print("token_type_ids:", tokenized_output['token_type_ids'])
print("attention_mask:", tokenized_output['attention_mask'])

input_ids: [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102]
token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1]


Saving a tokenizer is identical to saving a model:

In [8]:
tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json',
 'directory_on_my_computer/tokenizer.json')

We'll talk more about <font color='blue'>token_type_ids</font> in [Chapter 3](https://huggingface.co/course/chapter3), and we'll explain the <font color='blue'>attention_mask</font> key a little later. First, let's see how the <font color='blue'>input_ids</font> are <font color='blue'>generated</font>. To do this, we'll need to look at the intermediate methods of the tokenizer.

### Encoding

<font color='blue'>Translating text</font> to <font color='blue'>numbers</font> is known as <font color='blue'>encoding</font>. Encoding is done in a <font color='blue'>two-step process</font>: the <font color='blue'>tokenization</font>, followed by the <font color='blue'>conversion to input IDs</font>.

As we've seen, the <font color='blue'>first step</font> is to <font color='blue'>split the text into words</font> (or parts of words, punctuation symbols, etc.), usually called *tokens*. There are multiple rules that can govern that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained.

The <font color='blue'>second step</font> is to <font color='blue'>convert</font> those <font color='blue'>tokens</font> into <font color='blue'>numbers</font>, so we can build a tensor out of them and feed them to the model. To do this, the <font color='blue'>tokenizer</font> has a <font color='blue'>vocabulary</font>, which is the part we <font color='blue'>download</font> when we <font color='blue'>instantiate</font> it with the <font color='blue'>from_pretrained() method</font>. Again, we need to use the same vocabulary used when the model was pretrained.

To get a better understanding of the two steps, we'll explore them separately. Note that we will use some methods that perform parts of the tokenization pipeline separately to show you the intermediate results of those steps, but in practice, you should call the tokenizer directly on your inputs (as shown in the section 2).


### Tokenization

The tokenization process is done by the <font color='blue'>tokenize() method</font> of the tokenizer:

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

The output of this method is a <font color='blue'>list of strings</font>, or tokens:

In [10]:
print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


This tokenizer is a <font color='blue'>subword tokenizer</font>: it <font color='blue'>splits</font> the <font color='blue'>words</font> until it obtains <font color='blue'>tokens</font> that can be <font color='blue'>represented</font> by its <font color='blue'>vocabulary</font>. That's the case here with transformer, which is split into two tokens: `transform` and `##er`.

### From tokens to input IDs

The <font color='blue'>conversion to input IDs</font> is handled by the `convert_tokens_to_ids()` tokenizer method:

In [11]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


These outputs, once converted to the appropriate framework tensor, can then be used as inputs to a model as seen earlier in this chapter.

**Try it out!** Replicate the <font color='blue'>two last steps</font> (tokenization and conversion to input IDs) on the <font color='blue'>input sentences</font> we used in <font color='blue'>section 2</font> (“I've been waiting for a HuggingFace course my whole life.” and “I hate this so much!”). Check that you get the same input IDs we got earlier!

In [12]:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence1 = "I've been waiting for a HuggingFace course my whole life."
sequence2 = "I hate this so much!"
tokens1 = tokenizer.tokenize(sequence1)
tokens2 = tokenizer.tokenize(sequence2)
print(tokens1)
print(tokens2)

['I', "'", 've', 'been', 'waiting', 'for', 'a', 'Hu', '##gging', '##F', '##ace', 'course', 'my', 'whole', 'life', '.']
['I', 'hate', 'this', 'so', 'much', '!']


In [13]:
ids1 = tokenizer.convert_tokens_to_ids(tokens1)
ids2 = tokenizer.convert_tokens_to_ids(tokens2)
print(ids1)
print(ids2)

[146, 112, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119]
[146, 4819, 1142, 1177, 1277, 106]


This matches our previous approach in [Behind the pipeline](https://huggingface.co/learn/nlp-course/chapter2/2).

In [26]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

# Input sentences
print("Input Sentences:")
print(raw_inputs[0])
print(raw_inputs[1])
print()

# Input IDs
print("Input IDs:")
print(inputs['input_ids'])

# Show a conversion of IDs to tokens
tokens = [tokenizer.convert_ids_to_tokens(ids) for ids in inputs['input_ids']]
print("\nTokens:")
for i, token_list in enumerate(tokens):
    print(f"Input {i + 1}: {token_list}")

# Attention mask
print("\nAttention Mask:")
print(inputs['attention_mask'])

Input Sentences:
I've been waiting for a HuggingFace course my whole life.
I hate this so much!

Input IDs:
tensor([[  101,   146,   112,  1396,  1151,  2613,  1111,   170, 20164, 10932,
          2271,  7954,  1736,  1139,  2006,  1297,   119,   102],
        [  101,   146,  4819,  1142,  1177,  1277,   106,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0]])

Tokens:
Input 1: ['[CLS]', 'I', "'", 've', 'been', 'waiting', 'for', 'a', 'Hu', '##gging', '##F', '##ace', 'course', 'my', 'whole', 'life', '.', '[SEP]']
Input 2: ['[CLS]', 'I', 'hate', 'this', 'so', 'much', '!', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']

Attention Mask:
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])


### Decoding

 *Decoding* is going the other way around: from <font color='blue'>vocabulary indices</font>, we want to get a <font color='blue'>string</font>. This can be done with the `decode()` method as follows:

In [15]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


Note that the *decode* method not only converts the indices back to tokens, but also <font color='blue'>groups togethe</font> the <font color='blue'>tokens</font> that were <font color='blue'>part</font> of the <font color='blue'>same words</font> to <font color='blue'>produce</font> a <font color='blue'>readable sentence</font>. This behavior will be extremely useful when we use models that predict new text (either text generated from a prompt, or for sequence-to-sequence problems like translation or summarization).

By now you should understand the atomic operations a tokenizer can handle: tokenization, conversion to IDs, and converting IDs back to a string. However, we've just scraped the tip of the iceberg. In the following section, we'll take our approach to its limits and take a look at how to overcome them.