<a href="https://colab.research.google.com/github/niltonmalves/tokenizers_datasets_transformers/blob/main/Intro_Tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://huggingface.co/docs/transformers/preprocessing

https://huggingface.co/docs/tokenizers/python/latest/quicktour.html

Quicktour \
**Build a tokenizer from scratch**

In [1]:
!wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
!unzip wikitext-103-raw-v1.zip

--2022-03-29 18:14:50--  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 3.5.1.157
Connecting to s3.amazonaws.com (s3.amazonaws.com)|3.5.1.157|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 191984949 (183M) [application/zip]
Saving to: ‘wikitext-103-raw-v1.zip’


2022-03-29 18:14:55 (35.0 MB/s) - ‘wikitext-103-raw-v1.zip’ saved [191984949/191984949]

Archive:  wikitext-103-raw-v1.zip
   creating: wikitext-103-raw/
  inflating: wikitext-103-raw/wiki.test.raw  
  inflating: wikitext-103-raw/wiki.valid.raw  
  inflating: wikitext-103-raw/wiki.train.raw  


**Training the tokenizer** \
In this tour, we will build and train [a Byte-Pair Encoding (BPE) tokenizer](https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0#:~:text=%F0%9F%8F%83-,Byte%2DPair%20Encoding%20(BPE),Data%20Compression%E2%80%9D%20published%20in%201994.). For more information about the different type of tokenizers, check out this [guide](https://huggingface.co/transformers/tokenizer_summary.html) in the 🤗 Transformers documentation. Here, training the tokenizer means it will learn merge rules by:



*   Start with all the characters present in the training corpus as tokens.
*  Identify the most common pair of tokens and merge it into one token.
*   Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want.


The main API of the library is the class Tokenizer, here is how we instantiate one with a BPE model:

In [2]:
!pip install tokenizers

Collecting tokenizers
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 5.2 MB/s 
[?25hInstalling collected packages: tokenizers
Successfully installed tokenizers-0.11.6


In [3]:
#The main API of the library is the class Tokenizer, here is how we instantiate one with a BPE model:
from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer

<tokenizers.Tokenizer at 0x55d35da98a00>

In [4]:
#To train our tokenizer on the wikitext files, we will need to instantiate a trainer, in this case a BpeTrainer
from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

"""
We can set the training arguments like vocab_size or min_frequency (here left at their default values of 30,000 and 0) \
but the most important part is to give the special_tokens we plan to use later on (they are not used at all during training)\
so that they get inserted in the vocabulary.

Note:
The order in which you write the special tokens list matters: here "[UNK]" will get the ID 0, "[CLS]" will get the ID 1 and so forth.
"""

'\nWe can set the training arguments like vocab_size or min_frequency (here left at their default values of 30,000 and 0) but the most important part is to give the special_tokens we plan to use later on (they are not used at all during training)so that they get inserted in the vocabulary.\n\nNote:\nThe order in which you write the special tokens list matters: here "[UNK]" will get the ID 0, "[CLS]" will get the ID 1 and so forth.\n'

CLS == classifier \
SEP == separator

We could train our tokenizer right now, but it wouldn’t be optimal. Without a pre-tokenizer that will split our inputs into words, we might get tokens that overlap several words: for instance we could get an "it is" token since those two words often appear next to each other. Using a pre-tokenizer will ensure no token is bigger than a word returned by the pre-tokenizer. Here we want to train a subword BPE tokenizer, and we will use the easiest pre-tokenizer possible by splitting on whitespace.

In [5]:
from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace()

In [6]:
!ls

sample_data  wikitext-103-raw  wikitext-103-raw-v1.zip


In [7]:
#Now, we can just call the train() method with any list of files we want to use:
files = [f"./wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
files


['./wikitext-103-raw/wiki.test.raw',
 './wikitext-103-raw/wiki.train.raw',
 './wikitext-103-raw/wiki.valid.raw']

In [8]:
tokenizer.train(files, trainer)

This should only take a few seconds to train our tokenizer on the full wikitext dataset! To save the tokenizer in one file that contains all its configuration and vocabulary, just use the save() method:

In [9]:
tokenizer.save("./tokenizer-wiki.json")

In [10]:
tokenizer = Tokenizer.from_file("./tokenizer-wiki.json")

#Using the *tokenizer*

In [11]:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")

In [12]:
print(output.tokens)

['Hello', ',', 'y', "'", 'all', '!', 'How', 'are', 'you', '[UNK]', '?']


In [13]:
print(output.ids)

[27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35]


An important feature of the 🤗 Tokenizers library is that it comes with full alignment tracking, meaning you can always get the part of your original sentence that corresponds to a given token. Those are stored in the offsets attribute of our Encoding object. For instance, let’s assume we would want to find back what caused the "[UNK]" token to appear, which is the token at index 9 in the list, we can just ask for the offset at the index:

In [14]:
print(output.offsets[9])
# (26, 27)

(26, 27)


In [15]:
sentence = "Hello, y'all! How are you 😁 ?"
sentence[26:27]
# "😁"

'😁'

#Post-processing

We might want our tokenizer to automatically add special tokens, like "[CLS]" or "[SEP]". To do this, we use a post-processor. TemplateProcessing is the most commonly used, you just have to specify a template for the processing of single sentences and pairs of sentences, along with the special tokens and their IDs.

When we built our tokenizer, we set "[CLS]" and "[SEP]" in positions 1 and 2 of our list of special tokens, so this should be their IDs. To double-check, we can use the token_to_id() method:

In [16]:
tokenizer.token_to_id("[SEP]")
# 2

2

Here is how we can set the post-processing to give us the traditional BERT inputs:

In [17]:
from tokenizers.processors import TemplateProcessing

tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

Let’s go over this snippet of code in more details. First we specify the template for single sentences: those should have the form  
```[CLS] $A [SEP] ```
 where $A represents our sentence.

Then, we specify the template for sentence pairs, which should have the form ```[CLS] $A [SEP] $B [SEP]``` where ```$A ``` represents the first sentence and ```$B ``` the second one. The ```:1 ``` added in the template represent the type IDs we want for each part of our input: it defaults to 0 for everything (which is why we don’t have ```$A:0 ``` ) and here we set it to 1 for the tokens of the second sentence and the last ```[SEP]``` token.

Lastly, we specify the special tokens we used and their IDs in our tokenizer’s vocabulary.

To check out this worked properly, let’s try to encode the same sentence as before:


In [18]:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["[CLS]", "Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?", "[SEP]"]

['[CLS]', 'Hello', ',', 'y', "'", 'all', '!', 'How', 'are', 'you', '[UNK]', '?', '[SEP]']


In [19]:
output = tokenizer.encode("Hello, y'all!", "How are you 😁 ?")
print(output.tokens)
# ["[CLS]", "Hello", ",", "y", "'", "all", "!", "[SEP]", "How", "are", "you", "[UNK]", "?", "[SEP]"]

['[CLS]', 'Hello', ',', 'y', "'", 'all', '!', '[SEP]', 'How', 'are', 'you', '[UNK]', '?', '[SEP]']


In [20]:
print(output.type_ids)
# [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]


In [21]:
#If you save your tokenizer with save(), the post-processor will be saved along.
tokenizer.save('out.json')

#Encoding multiple sentences in a batch

To get the full speed of the 🤗 Tokenizers library, it’s best to process your texts by batches by using the encode_batch() method:

In [22]:
output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"])

The output is then a list of Encoding objects like the ones we saw before. You can process together as many texts as you like, **as long as it fits in memory**.

In [23]:
output

[Encoding(num_tokens=8, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]

To process a batch of sentences pairs, pass two lists to the encode_batch() method: the list of sentences A and the list of sentences B:

In [24]:
output = tokenizer.encode_batch(
    [["Hello, y'all!", "How are you 😁 ?"], ["Hello to you too!", "I'm fine, thank you!"]]
)
output

[Encoding(num_tokens=14, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=16, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]

When encoding multiple sentences, you can automatically pad the outputs to the longest sentence present by using enable_padding(), with the pad_token and its ID (which we can double-check the id for the padding token with token_to_id() like before):

In [25]:
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")

In [26]:
tokenizer.token_to_id("[PAD]")

3

In [27]:
output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"])
print(output[1].tokens)
# ["[CLS]", "How", "are", "you", "[UNK]", "?", "[SEP]", "[PAD]"]

['[CLS]', 'How', 'are', 'you', '[UNK]', '?', '[SEP]', '[PAD]']


In [28]:
print(output[1].attention_mask)
# [1, 1, 1, 1, 1, 1, 1, 0]

[1, 1, 1, 1, 1, 1, 1, 0]


# Using a pretrained tokenizer

You can load any tokenizer from the Hugging Face Hub as long as a tokenizer.json file is available in the repository.

In [29]:
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

#  Importing a pretrained tokenizer from legacy vocabulary files

You can also import a pretrained tokenizer directly in, as long as you have its vocabulary file. For instance, here is how to import the classic pretrained BERT tokenizer:

In [30]:
!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt

--2022-03-29 18:17:12--  https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.139.157
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.139.157|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231508 (226K) [text/plain]
Saving to: ‘bert-base-uncased-vocab.txt’


2022-03-29 18:17:12 (2.56 MB/s) - ‘bert-base-uncased-vocab.txt’ saved [231508/231508]



In [31]:
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)