https://huggingface.co/learn/nlp-course/chapter2/4?fw=pt

# Loading and saving

Loading and saving tokenizers is as simple as it is with models. Actually, it’s based on the same two methods: `from_pretrained()` and `save_pretrained()`. These methods will load or save the algorithm used by the tokenizer (a bit like the architecture of the model) as well as its vocabulary (a bit like the weights of the model).

In [1]:
# Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the BertTokenizer class:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [2]:
tokenizer

BertTokenizer(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

In [3]:
# Similar to AutoModel, the AutoTokenizer class will grab the proper tokenizer class in the library based on the checkpoint name,
# and can be used directly with any checkpoint:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


In [4]:
# We can now use the tokenizer as shown in the previous section:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [5]:
inputs = tokenizer("As armas e os barões assinalados, na ocidental praia lusitana")

print(inputs['input_ids'])

[101, 1249, 1981, 2225, 174, 184, 1116, 2927, 28207, 1279, 3919, 14196, 9359, 1116, 117, 9468, 184, 16388, 22692, 185, 14089, 1161, 181, 1361, 5168, 1605, 102]


In [None]:
# Saving a tokenizer is identical to saving a model:
# tokenizer.save_pretrained("directory_on_my_computer")

In [9]:
tokens = tokenizer.tokenize("As armas e os barões assinalados, na ocidental praia lusitana")

print(tokens)

['As', 'arm', '##as', 'e', 'o', '##s', 'bar', '##õ', '##es', 'ass', '##inal', '##ado', '##s', ',', 'na', 'o', '##cid', '##ental', 'p', '##rai', '##a', 'l', '##us', '##ita', '##na']


In [14]:
tokens = tokenizer.tokenize("Let's go to the beach on a spaceship with wings and cat's paws")

print(tokens)

['Let', "'", 's', 'go', 'to', 'the', 'beach', 'on', 'a', 'spaces', '##hip', 'with', 'wings', 'and', 'cat', "'", 's', 'p', '##aws']


In [15]:
# let's get this converted to numbers
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

[2421, 112, 188, 1301, 1106, 1103, 4640, 1113, 170, 6966, 3157, 1114, 4743, 1105, 5855, 112, 188, 185, 19194]


In [17]:
# this is not done yet. we are missing the separator tokens required by the model
# notice that it's mostly the same as above but with 101 added at the head, 102 at the end

final_inputs = tokenizer.prepare_for_model(input_ids)
print(final_inputs["input_ids"])

[101, 2421, 112, 188, 1301, 1106, 1103, 4640, 1113, 170, 6966, 3157, 1114, 4743, 1105, 5855, 112, 188, 185, 19194, 102]


In [6]:
# you can use the decode() method to see the special tokens: [CLS], [SEP]
# These all depend on the model you are using 
inputs = tokenizer("As armas e os barões assinalados, na ocidental praia lusitana")

print(tokenizer.decode(inputs['input_ids']))

[CLS] As armas e os barões assinalados, na ocidental praia lusitana [SEP]


In [7]:
# example with another model, roberta, which uses a different set of special tokens, html-like

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
inputs = tokenizer("As armas e os barões assinalados, na ocidental praia lusitana")
print(tokenizer.decode(inputs['input_ids']))


<s>As armas e os barões assinalados, na ocidental praia lusitana</s>


In [None]:
# Now we know the different steps of the tokenizer, and we can go back to just using the tokenizer method.