# Tokenizers

In NLP, tokenizers translate text into data ready to be processed by the model. There are different approaches, such as word, character and subword level, BPE, SentencePiece, etc. Each pre-trained model is coupled with the tokenizer that was used to process its training data. As we did with models, tokenizers are loaded with the name of the checkpoint

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

More generally, we can revert to the [AutoTokenizer class](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#autotokenizer) and instanstiated by the checkpoint.

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer("Using a Transformer network is simple")

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Tokenizers perform two processes, the actual tokenization going from text to text to reduce the vocabulary size and the encoding, converting tokens into a numerical representation. Tokenization carried out by tokenize() method of [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer):

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


Encoding: From tokens to input IDs handled by convert_tokens_to_ids() of [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer):

In [5]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


Decoding: From vocabulary indices to string including detokenization

In [6]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


**Exercise**: In Colab, find pre-trained SentencePiece and Encoder models and apply them to the sample sentence ”mT5: A massively multilingual pre-trained text-to-text transformer”. [Solution](https://colab.research.google.com/github/jorcisai/ARF/blob/master/HuggingFace/05-Tokenizers-Solution.ipynb)