# Tokenizers

In NLP, tokenizers translate text into data ready to be processed by the model. There are different approaches, such as word, character and subword level, BPE, SentencePiece, etc. Each pre-trained model is coupled with the tokenizer that was used to process its training data. As we did with models, tokenizers are loaded with the name of the checkpoint

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

More generally, we can revert to the [AutoTokenizer class](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#autotokenizer) and instanstiated by the checkpoint.

In [19]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Tokenizers perform two processes, the actual tokenization going from text to text to reduce the vocabulary size and the encoding, converting tokens into a numerical representation. Tokenization carried out by tokenize() method of [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer):

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


Encoding: From tokens to input IDs handled by convert_tokens_to_ids() of [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer):

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


Decoding: From vocabulary indices to string including detokenization

In [None]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


**Exercise**: In Colab, find pre-trained SentencePiece and Encoder models and apply them to the sample sentence ”mT5: A massively multilingual pre-trained text-to-text transformer”. [Solution](https://colab.research.google.com/github/jorcisai/ARF/blob/master/HuggingFace/05-Tokenizers-Solution.ipynb)

The following code preprocess a sequence using methods of tokenizer.

In [18]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
print(input_ids)

tensor([ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
         2026,  2878,  2166,  1012])


Compare the output of the previous code with the following code using tokenizer directly

In [None]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

## Dealing with multiple sequences

⊲ Multiple sentences of different length can be tokenized at the same time

In [13]:
seqs = ["short sentence","sentence with more than two words"]
tokenized_inputs = tokenizer(seqs)
print(tokenized_inputs['input_ids'])

[[101, 2460, 6251, 102], [101, 6251, 2007, 2062, 2084, 2048, 2616, 102]]


However, they cannot be converted into tensors due to different length

In [None]:
tokenized_inputs = tokenizer(seqs, return_tensors="pt")

So, [padding and truncation](https://huggingface.co/docs/transformers/pad_truncation) are usually necessary. First, let us see padding:

In [15]:
tokenized_inputs = tokenizer(seqs,return_tensors="pt",padding=True)
print(tokenized_inputs)

{'input_ids': tensor([[ 101, 2460, 6251,  102,    0,    0,    0,    0],
        [ 101, 6251, 2007, 2062, 2084, 2048, 2616,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])}


You can take a look at the first sentence after being tokenized and padded:

In [16]:
print(tokenizer.decode(tokenized_inputs['input_ids'][0]))

[CLS] short sentence [SEP] [PAD] [PAD] [PAD] [PAD]


Transformer models are trained on sequences with a maximum length and its performance degrades rapidly, if inference uses longer sequences. A straightforward solution is truncation

In [17]:
tokenized_inputs = tokenizer(seqs,return_tensors="pt",truncation=True,max_length=4)
print(tokenized_inputs)

{'input_ids': tensor([[ 101, 2460, 6251,  102],
        [ 101, 6251, 2007,  102]]), 'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 1]])}


You can check out the following [notebook for further basic options on the tokenizer](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter2/section6_pt.ipynb)