# Tokenizers

In NLP, tokenizers translate text into data ready to be processed by the model. 
The library contains tokenizers for all the models. Most of the tokenizers are available in two types: a full python implementation and a “Fast” implementation based on the Rust library Tokenizers. The “Fast” implementations allows:

* a significant speed-up in particular when doing batched tokenization
* additional methods to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).

The base classes [PreTrainedTokenizer](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizer) and [PreTrainedTokenizerFast](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast) implement the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and “Fast” tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library.

[PreTrainedTokenizer](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizer) and [PreTrainedTokenizerFast](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast) implement the main methods for using all the tokenizers:

* Tokenizing (splitting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e., tokenizing and converting to integers).
* Adding new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece…).
* Managing special tokens (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization.

The [BatchEncoding](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.BatchEncoding) class stores the output of the PreTrainedTokenizer[Fast] encoding methods and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behaves just like a standard python dictionary and stores the various model inputs computed by these methods (input_ids, attention_mask, etc.). When the tokenizer is a Fast tokenizer, this class provides in addition several **advanced alignment methods** which can be used to map between the original string (character and words) and the token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding to a given token). This functionality of Fast tokenizers is very convenient when trying to recover the original string to apply the output of the model.

There are different tokenizers, such as word, character and subword level, BPE, SentencePiece, etc. Each pre-trained model is coupled with the tokenizer that was used to process its training data.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In this notebook, we are using the [BertTokenizer](https://huggingface.co/docs/transformers/en/model_doc/bert#transformers.BertTokenizer), a [WordPiece tokenizer](https://huggingface.co/course/chapter6/6) that inherits from the [PreTrainedTokenizer](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizer) class. As we did with models, tokenizers are loaded with the name of the checkpoint.



In [2]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

However, the [AutoTokenizer class](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#autotokenizer) loads a Fast Tokenizer by default, that inherits from [PreTrainedTokenizerFast](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast), instanstiated from the checkpoint.

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer("Using a Transformer network is simple")

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Tokenizers perform two processes, the actual tokenization going from text to text to reduce the vocabulary size and the encoding, converting tokens into a numerical representation. The first step is the actual tokenization carried out by the function [tokenize()](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.tokenize) of [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer):

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


Encoding: From tokens to input IDs handled by the function [convert_tokens_to_ids()](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.convert_tokens_to_ids) of [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer):

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


Decoding: From vocabulary indices to string including detokenization using the [decode()](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode) function:

In [None]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


**Exercise**: Use pre-trained SentencePiece and Encoder from [T5 model](https://huggingface.co/docs/transformers/model_doc/t5) and apply them to the sample sentence ”Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. [Solution](https://colab.research.google.com/github/jorcisai/ARF/blob/master/HuggingFace/05-Tokenizers-Solution.ipynb)

This is another example of the usage of the tokenizer functions to compare it with the end-to-end tokenizer:

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "This notebook provides an overview of the functions in the tokenizer class."
tokens = tokenizer.tokenize(sequence)
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
print(input_ids)

['this', 'notebook', 'provides', 'an', 'overview', 'of', 'the', 'functions', 'in', 'the', 'token', '##izer', 'class', '.']
tensor([ 2023, 14960,  3640,  2019, 19184,  1997,  1996,  4972,  1999,  1996,
        19204, 17629,  2465,  1012])


Compare the output of the previous code with the following code using tokenizer directly

In [None]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

tensor([[  101,  2023, 14960,  3640,  2019, 19184,  1997,  1996,  4972,  1999,
          1996, 19204, 17629,  2465,  1012,   102]])


As you can notice, the tokenizer function prepares the sequence taking into account the input that the model expects including special tokens (classification `[CLS]` and separator `[SEP]`):

In [None]:
tokenizer.decode(input_ids)

'this notebook provides an overview of the functions in the tokenizer class.'

In [None]:
tokenizer.decode(tokenized_inputs["input_ids"][0])

'[CLS] this notebook provides an overview of the functions in the tokenizer class. [SEP]'

## Dealing with multiple sequences

Multiple sentences of different length can be tokenized at the same time

In [None]:
seqs = ["short sentence","sentence with more than two words"]
tokenized_inputs = tokenizer(seqs)
print(tokenized_inputs['input_ids'])

[[101, 2460, 6251, 102], [101, 6251, 2007, 2062, 2084, 2048, 2616, 102]]


However, they cannot be converted into tensors due to different length

In [None]:
tokenized_inputs = tokenizer(seqs, return_tensors="pt")

So, [padding and truncation](https://huggingface.co/docs/transformers/pad_truncation) are usually necessary. First, let us see padding:

In [None]:
tokenized_inputs = tokenizer(seqs,return_tensors="pt",padding=True)
print(tokenized_inputs)

{'input_ids': tensor([[ 101, 2460, 6251,  102,    0,    0,    0,    0],
        [ 101, 6251, 2007, 2062, 2084, 2048, 2616,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])}


You can take a look at the first sentence after being tokenized and padded:

In [None]:
print(tokenizer.decode(tokenized_inputs['input_ids'][0]))

[CLS] short sentence [SEP] [PAD] [PAD] [PAD] [PAD]


Transformer models are trained on sequences with a maximum length and its performance degrades rapidly, if inference uses longer sequences. A straightforward solution is truncation:

In [None]:
tokenized_inputs = tokenizer(seqs,return_tensors="pt",truncation=True,max_length=4)
print(tokenized_inputs)

{'input_ids': tensor([[ 101, 2460, 6251,  102],
        [ 101, 6251, 2007,  102]]), 'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 1]])}


You can check out the following [notebook for further basic options on the tokenizer](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter2/section6_pt.ipynb).